Created on 05-23-2018 09:28 PM - edited 09-16-2022 06:16 AM
Hello,
I have date-partitioned internal Hive tables and I want to backup year 2017 to external tapes (and then drop them). This is ~10TB of data.
What is the recommended way to copy this data out of HDFS?
Thank you,
Gerasimos
Created 05-24-2018 02:25 AM
There are different approaches. First let us see, the available options and their pros & cons then you can choose (or) if possible combine them as needed
1. Cloudera Manager -> Backup Menu option (or) distcp option
Pros:
a. Easy to take backup
Cons:
a. It will support between two different clusters, so it may not be suitable for your requirement
2. Export/Import Option.
Step1: Execute the below command from HDFS and export the working db.table to HDFS path and move to local as needed. It will export both data & metadata
> hive -S -e "export table $schema_file1.$tbl_file1 to '$HDFS_DATA_PATH/$tbl_file1';"
Step2: Run the below import command twice, First import will through an error as table doesn't exist and it will create table but the Second import will import the data too
> hive -S -e "import table $schema_file1.$tbl_file1 from '$HDFS_DATA_PATH/$tbl_file1';"
Note: You can hard code $ with actual path/file/table
Pros:
a. Export/Import will take care of both data & metadata. so you don't need to handle metadata separately
Cons:
a. I've used it long back for non-partitioned tables, not sure how it will support partition tables, pls double check
b. Need to apply import/export for each table
3. Move HDFS data to local and local to tape & take metadata backup separately. Ex: Mysql - so many links available online about how to take mysql backup
pros:
a. Metadatabackup is possible for entire db
Cons:
a. May Need to take hdfs file by file depends upon your local FS capacity
There could be other options too, please update below if you/anyone find something
Created 05-24-2018 06:59 AM
Thank you @saranvisa
Option 1 is not feasible.
Option 2 exports to HDFS. I do not have so much space left (10TB) to export and also I need to move it from HDFS to the tapes.
Option 3: what is 'local' in that case?
I assume that I need a workflow that will copy (that amount of) data from Hive (HDFS) to a remote location or a mount folder.
Created 05-30-2018 03:40 AM
Option 3: I meant linux File system as local
Also just want to make sure the 10TB that you have mentioned is before replication (or) after replication?
Created 05-30-2018 03:46 AM
Thank you @saranvisa
10TB is the data to move (and delete to free space)
2TB is the space currently left on each node.
Created 08-22-2018 06:42 AM
Dear @saranvisa
Concerning you suggestion to export the table:
> hive -S -e "export table $schema_file1.$tbl_file1 to '$HDFS_DATA_PATH/$tbl_file1';"
I did export a partition of the table, and I saw that I got:
1. _metadata
2. a copy of the directory structure as in /user/hive/warehouse/myTable/thePartition
Since step [2] is too time consuming and generates a replication of the partition's data, is there another way to export only the _metadata, and copy the file directly from hive warehouse folder? Or do I miss something more with the export command?
Created 05-25-2018 11:23 PM