Created 04-24-2018 12:56 PM
Hello guys,
I need to migrate some Hbase and Hive structure from one hadoop cluster (A) to another hadoop cluster (B).
The first iteration must be a full data migration and the next iterations must be only for new data inserted in Origin cluster (A), in the other words, an incremental data migration.
Can DistCP be a good solution to the above use case? What would you recommend?
Thanks guys!
Created 04-25-2018 08:37 PM
HBase replication might not be the best approach to synchronize the data in the initial phase of migration. I would have recommended snapshots but since you are upgrading to a higher version, that may not work as well. So follows the multi-step approach to migrate your HBase data over.
A replication once you have the majority of your data copied over will put way less stress on your cluster bandwidth and you shall be easily able to take care of the migration with bandwidth available for other operations.
As far as the migration of "Hive structures" is concerned, do you mean the metadata or the underlying data? If you are talking about underlying data, of course distcp is the best option available. For metadata migration, there are multiple options available and metastore mapping to new cluster is one of the options.
Let know if this answer helped resolving your query.
Created 04-24-2018 03:04 PM
For Hbase - I'd suggest Hbase replication so that the data at origin cluster will be in sync with destination cluster.
For Hive - You can use Falcon to incrementally replicate the hive tables.
Created 04-25-2018 02:59 PM
Tks Sandeep!
One more question:
Do you see any issue in use Hbase Replication between hadoop clusters with different technology version (eg: 2.2 -> 2.6)?
Tks again!
Created 04-25-2018 08:37 PM
HBase replication might not be the best approach to synchronize the data in the initial phase of migration. I would have recommended snapshots but since you are upgrading to a higher version, that may not work as well. So follows the multi-step approach to migrate your HBase data over.
A replication once you have the majority of your data copied over will put way less stress on your cluster bandwidth and you shall be easily able to take care of the migration with bandwidth available for other operations.
As far as the migration of "Hive structures" is concerned, do you mean the metadata or the underlying data? If you are talking about underlying data, of course distcp is the best option available. For metadata migration, there are multiple options available and metastore mapping to new cluster is one of the options.
Let know if this answer helped resolving your query.