Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Backup Hive tables to external tapes

Rising Star

Hello,

 

I have date-partitioned internal Hive tables and I want to backup year 2017 to external tapes (and then drop them). This is ~10TB of data.

 

What is the recommended way to copy this data out of HDFS?

 

Thank you,

Gerasimos

6 REPLIES 6

Champion

@gerasimos

 

There are different approaches. First let us see, the available options and their pros & cons then you can choose (or) if possible combine them as needed


1. Cloudera Manager -> Backup Menu option (or) distcp option

Pros:
a. Easy to take backup

Cons:
a. It will support between two different clusters, so it may not be suitable for your requirement


2. Export/Import Option.

Step1: Execute the below command from HDFS and export the working db.table to HDFS path and move to local as needed. It will export both data & metadata
> hive -S -e "export table $schema_file1.$tbl_file1 to '$HDFS_DATA_PATH/$tbl_file1';"


Step2: Run the below import command twice, First import will through an error as table doesn't exist and it will create table but the Second import will import the data too

> hive -S -e "import table $schema_file1.$tbl_file1 from '$HDFS_DATA_PATH/$tbl_file1';"

Note: You can hard code $ with actual path/file/table


Pros:
a. Export/Import will take care of both data & metadata. so you don't need to handle metadata separately

Cons:
a. I've used it long back for non-partitioned tables, not sure how it will support partition tables, pls double check
b. Need to apply import/export for each table

 

3. Move HDFS data to local and local to tape & take metadata backup separately. Ex: Mysql - so many links available online about how to take mysql backup

 

pros:

a. Metadatabackup is possible for entire db

 

Cons:

a. May Need to take hdfs file by file depends upon your local FS capacity

 

 

There could be other options too, please update below if you/anyone find something

 

Rising Star

Thank you @saranvisa

 

Option 1 is not feasible.

 

Option 2 exports to HDFS. I do not have so much space left (10TB) to export and also I need to move it from HDFS to the tapes.

 

Option 3: what is 'local' in that case?

 

I assume that I need a workflow that will copy (that amount of) data from Hive (HDFS) to a remote location or a mount folder.  

Champion

@gerasimos

 

Option 3: I meant linux File system as local 

 

Also just want to make sure the 10TB that you have mentioned is before replication (or) after replication?

Rising Star

Thank you @saranvisa

 

10TB is the data to move (and delete to free space)

 

2TB is the space currently left on each node.

Rising Star

Dear @saranvisa

 

Concerning you suggestion to export the table:

 

> hive -S -e "export table $schema_file1.$tbl_file1 to '$HDFS_DATA_PATH/$tbl_file1';"

 

I did export a partition of the table, and I saw that I got:

 

1. _metadata

2. a copy of the directory structure as in /user/hive/warehouse/myTable/thePartition

 

Since step [2] is too time consuming and generates a replication of the partition's data, is there another way to export only the _metadata, and copy the file directly from hive warehouse folder? Or do I miss something more with the export command?

Champion
Would you consider converting them to HAR file .
Below is the reference for further reading

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Archiving
Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.