If we were to implement an HDP cluster and move our data in it, can we re-use the data nodes should we decide to move to a different distribution? Or are we forced to create new data nodes and migrate the data into them?
Hi @Ahmad Debbas,
Depends on which distro and which version we are using. As long as it's Apache HDFS and the same version it will definitely work. For Apache HDFS but different version we need to upgrade and that procedure might differ depending on versions.
/Best regards, Mats
You are a little vague but I would assume you are asking if in future you want to move to CDH. Like @Mats Johansson mentioned, moving to another open source distribution like Apache open source is pretty simple. Now in case of CDH, let me flip your question to how would you migrate from CDH to another distribution. Cloudera uses multiple proprietary components on top of Hadoop and if you try to switch, then you would not be allowed to use those components. This makes migration a little harder but it's doable using custom scripts. With HDP everything is free to use even if you are not a customer so you can continue to use the platform along with all its components which you can use to migrate to another platform.
In short, a simple distcp is required to move data to another cluster. Can you use same data nodes or same cluster and install two distributions? I have done that, in fact I have that now but it depends on existing amount of data you have. If your data nodes have enough space to hold data, you should be able to do that. Here is how I would do it if I don't have enough space.
1. Reduce replication factor to two.
2. In new cluster, keep replication factor to two.
3. After successful migration of data to new cluster, yank the first cluster.
4. Increase replication factor to three.
All you need is enough space. This is doable but much difficult if you are on Cloudera and want to migrate to another distribution.