05-23-2018 09:28 PM
I have date-partitioned internal Hive tables and I want to backup year 2017 to external tapes (and then drop them). This is ~10TB of data.
What is the recommended way to copy this data out of HDFS?
05-24-2018 02:25 AM
There are different approaches. First let us see, the available options and their pros & cons then you can choose (or) if possible combine them as needed
1. Cloudera Manager -> Backup Menu option (or) distcp option
a. Easy to take backup
a. It will support between two different clusters, so it may not be suitable for your requirement
2. Export/Import Option.
Step1: Execute the below command from HDFS and export the working db.table to HDFS path and move to local as needed. It will export both data & metadata
> hive -S -e "export table $schema_file1.$tbl_file1 to '$HDFS_DATA_PATH/$tbl_file1';"
Step2: Run the below import command twice, First import will through an error as table doesn't exist and it will create table but the Second import will import the data too
> hive -S -e "import table $schema_file1.$tbl_file1 from '$HDFS_DATA_PATH/$tbl_file1';"
Note: You can hard code $ with actual path/file/table
a. Export/Import will take care of both data & metadata. so you don't need to handle metadata separately
a. I've used it long back for non-partitioned tables, not sure how it will support partition tables, pls double check
b. Need to apply import/export for each table
3. Move HDFS data to local and local to tape & take metadata backup separately. Ex: Mysql - so many links available online about how to take mysql backup
a. Metadatabackup is possible for entire db
a. May Need to take hdfs file by file depends upon your local FS capacity
There could be other options too, please update below if you/anyone find something
05-24-2018 06:59 AM
Thank you @saranvisa
Option 1 is not feasible.
Option 2 exports to HDFS. I do not have so much space left (10TB) to export and also I need to move it from HDFS to the tapes.
Option 3: what is 'local' in that case?
I assume that I need a workflow that will copy (that amount of) data from Hive (HDFS) to a remote location or a mount folder.
05-25-2018 11:23 PM