Support Questions

Find answers, ask questions, and share your expertise

Hive / HBase migration - Different clusters

avatar
New Contributor

Hello guys,

I need to migrate some Hbase and Hive structure from one hadoop cluster (A) to another hadoop cluster (B).

The first iteration must be a full data migration and the next iterations must be only for new data inserted in Origin cluster (A), in the other words, an incremental data migration.

Can DistCP be a good solution to the above use case? What would you recommend?

Thanks guys!

1 ACCEPTED SOLUTION

avatar
@Thiago Charchar

HBase replication might not be the best approach to synchronize the data in the initial phase of migration. I would have recommended snapshots but since you are upgrading to a higher version, that may not work as well. So follows the multi-step approach to migrate your HBase data over.

  1. Bulk HBase export to HDFS (time-in-point recovery approach).
  2. Hadoop Distcp sequence files to remote cluster where HBase tables are already created.
  3. Setup Replication and let tables be current.
  4. Choose a Date-time, plan a stagged cut-over of Applications.

A replication once you have the majority of your data copied over will put way less stress on your cluster bandwidth and you shall be easily able to take care of the migration with bandwidth available for other operations.

As far as the migration of "Hive structures" is concerned, do you mean the metadata or the underlying data? If you are talking about underlying data, of course distcp is the best option available. For metadata migration, there are multiple options available and metastore mapping to new cluster is one of the options.

Let know if this answer helped resolving your query.

View solution in original post

3 REPLIES 3

avatar
@Thiago Charchar

For Hbase - I'd suggest Hbase replication so that the data at origin cluster will be in sync with destination cluster.

For Hive - You can use Falcon to incrementally replicate the hive tables.

avatar
New Contributor

Tks Sandeep!

One more question:

Do you see any issue in use Hbase Replication between hadoop clusters with different technology version (eg: 2.2 -> 2.6)?

Tks again!

avatar
@Thiago Charchar

HBase replication might not be the best approach to synchronize the data in the initial phase of migration. I would have recommended snapshots but since you are upgrading to a higher version, that may not work as well. So follows the multi-step approach to migrate your HBase data over.

  1. Bulk HBase export to HDFS (time-in-point recovery approach).
  2. Hadoop Distcp sequence files to remote cluster where HBase tables are already created.
  3. Setup Replication and let tables be current.
  4. Choose a Date-time, plan a stagged cut-over of Applications.

A replication once you have the majority of your data copied over will put way less stress on your cluster bandwidth and you shall be easily able to take care of the migration with bandwidth available for other operations.

As far as the migration of "Hive structures" is concerned, do you mean the metadata or the underlying data? If you are talking about underlying data, of course distcp is the best option available. For metadata migration, there are multiple options available and metastore mapping to new cluster is one of the options.

Let know if this answer helped resolving your query.