Hi there, I am curious to know whether I can avoid a distcp copy to move data between clusters by attatching an existing instance running the hadoop datanode role to an entirely new cluster with all the HDFS data in place.
What would happen if the new cluster was a newer version of CDH?
I also have impala, yarn and hbase regionserver roles on these nodes, how would they react to be added to a new cluster?
Many thanks for any information I can use to make a case for or against doing this.
The DataNodes have no knowledge of the file or directory structure in HDFS. They store block files (or more accurately, they store block replicas which are one copy of a block file based on replication factor).
The NameNode contains the metadata and knows which DataNodes contain replicas of the blocks that comprise a file. If you could simply move the DataNode to a new cluster, the NameNode on the new cluster would tell the DataNode to delete the blocks.
However, it is not simple to move a node from one cluster to another without decommissioning and deleting the roles first.