We have a Hadoop cluster with ~1.5 PB of data (i.e. ~1500 TB), running on bare metal with CDH 5.7 and without Cloudera Manager. We're planning to decommission the cluster and set up a new one from the scratch (bare metal as well, not cloud), installing Cloudera Manager too this time. We're also moving the whole datacenter where it's currently located, so the new one will be on a different location. The idea is to keep all the data (all 1.5 PB of data is relevant, so unfortunately we can't get rid of anything). Just to clarify, we're talking about HDFS data as well as HBase databases/tables.
That being said, my question is:
Assuming we have our brand-new cluster set up and ready to ingest the data, what would be the best method to migrate all 1.5 PB of it to the new one? Needless to say we need to have the least possible downtime while doing all this.
Below is our current cluster's resources:
Thanks in advance!
If you have Cloudera Manager in source & target then you can use
Cloudera Manager -> Backup (menu) -> Peers and Add Peer (one time work)
Cloudera Manager -> Backup (menu) -> Replication schedules and Create schedule -> HDFS
This will transfer your data to destination without any downtime. But in your case CM is not available in source, so i think distcp should be the right option, (i am not very sure about hbase data transfer)