Created 08-23-2016 01:29 AM
What are the different options to transfer data from old cluster to new one. (HDFS/Hive/HBase) ?
Created 08-23-2016 02:26 AM
If you have same components on Target server its fine.
1) Distcp is one off the best option for data transfer between cluster for HDFS
2) Hbase follow below link it has two method.
http://hbase.apache.org/book.html#ops.backup :- Using Distcp for full dump
https://hbase.apache.org/book.html#ops.snapshots :- SnapShot database for Hbase
3) Hive Metastore , check what type of database uisng like Mysql, take full export of Mysql database as below.
For full dump, you can use "root" user
mysqldump -u [username]-p [password][dbname]> filename.sql
And if you wish to zip it at the sametime:
mysqldump -u [username]-p [password][db]| gzip > filename.sql.gz
You can then move this file between servers with:
scp user@xxx.xxx.xxx.xxx:/path_to_your_dump/filename.sql.gz your_detination_path/
Once copied import the all objects to my sql database and start he hive server
Created 08-23-2016 02:26 AM
If you have same components on Target server its fine.
1) Distcp is one off the best option for data transfer between cluster for HDFS
2) Hbase follow below link it has two method.
http://hbase.apache.org/book.html#ops.backup :- Using Distcp for full dump
https://hbase.apache.org/book.html#ops.snapshots :- SnapShot database for Hbase
3) Hive Metastore , check what type of database uisng like Mysql, take full export of Mysql database as below.
For full dump, you can use "root" user
mysqldump -u [username]-p [password][dbname]> filename.sql
And if you wish to zip it at the sametime:
mysqldump -u [username]-p [password][db]| gzip > filename.sql.gz
You can then move this file between servers with:
scp user@xxx.xxx.xxx.xxx:/path_to_your_dump/filename.sql.gz your_detination_path/
Once copied import the all objects to my sql database and start he hive server
Created 08-23-2016 02:08 PM
Apache Falcon is specifically designed to replicate data between clusters. It is another tool in your toolbox in addition to the suggestions provided by @zkfs.
Created 08-24-2016 04:06 AM
I will not repeat the content of the responses from @zkfs and @Michael Young. The above responses are great, but they are not exclusive, just complementary, my 2c. Falcon will help with HDFS, but it won't help with HBase. I would use Falcon for active-active clusters or disaster recovery. Your question implies that data is migrated from an old cluster to a new cluster. As such you could go with options from @zkfs, however, Falcon is also an option for the HDFS part, but as I said, the effort to set it up and administrate it is worth it for something that is a continuous replication, not one time deal. For that case too, HBase replication should be also considered. It was not mentioned in the above responses.
Created 09-01-2016 10:13 PM
We're going through this process now, migrating a non-trivial amount of data from an older cluster onto a new cluster and environment. We have a couple of requirements and constraints that limited some of the options:
After trying a number of tools and combinations, inculding WebHDFS and Knox combinations. we've settled on the following:
As part of managing the migration process, we're also making use of HDFS snapshots on both source and target to enable consistency management. Our migration jobs take snapshots at the beginning and end of each migration job and issue delta or difference reports to identify if data was modified and possibly missed during the migration process. I'm expecting that some of our larger data sets will take hours to complete, for the largest few, possible > 24hrs. In order to perform the snapshot management we also added some additional wrapper code. WebHDFS can be used to create and list snapshots, but it doesn't yet have an operation for returning a snapshot difference report.
For the hive metadata, the majority of our hive DDL exists in git/source code control. We're actually using this migration as an opportunity to enforce this for our production objects. For end user objects, e.g. analysts data labs, we're exporting the DDL on the old cluster and re-playing DDL on the new cluster - with tweeks for any reserved words collisions.
We don't have HBase operating on our old cluster so I didn't have to come up with a solution for that problem.