We are having a 7 node cluster ( 2 master and 5 slave nodes) on AWS EC2. And we need to do some performance benchmarking by using various types of instances. What would be the best startegy t o replace the existing 5 slave nodes with a new set of slave nodes with a different instance type. Since we have some data in HDFS. what would be the best strategy to retain the data and bring up the cluster with new nodes.
One of our thoughts, is to bring the new nodes by having the existing slave nodes and bring down one by one of the slave nodes. Any thoughts would be appreciated
You can add the new nodes and decommission the old EC2 instances at a study pace. Considering your cluster size I would recommend decomissioning just 1 node at any given time. On a side note if you are planning for performance benchmarking, you could always add new nodes, create configuration groups (leveraging the new memory/resource configs) via ambari to test how the new nodes are performing. This might prevent you from replacing all the nodes of the cluster.
Hope this helps-
Hi! The first question is where are you storing the HDFS data? Is in ephemeral store or EBS? If it is in ephemeral store the data disappears when you stop the datanode. EBS will persist with start/stops.
My first recommendation is to add the test datanodes to the cluster. You can then decommission the old nodes and force the data over to the new nodes and do your testing. Once the data is completely moved over you can delete the old nodes.
When you are ready to test the next set of datanodea repeat the above step. Add the new nodes, decommission the old nodes and delete them once the data is moved over.
If you are using EBS you can unmount and remount the drives on new datanodes. You have to be careful about UIDs, GIDs and ownership. It is pretty messy and not as simple as my first choice.
A third option is to copy all your data to S3 and then pull it back for each test cluster. This is my second choice method.
How are you building your cluster? Are you using Cloud Formation, Cloudbreak or Hortonworks Data Cloud? Hortonworks Data Cloud(HDC) was just released last week and would be a good choice to spin up different clusters for your testing. If you use HDC you should store your data in S3 and pull it back for each cluster.
Let me know if you have any questions.