Support Questions
Find answers, ask questions, and share your expertise

HDFS compression


We take backups of hive - physical files from the metastore.

We store it in a backup directory in hdfs which gets replicated to another site using Isilon.

Not sure how many know/use Isilon.

We use below distcp command to copy within the hdfs :

hadoop distcp -D mapreduce.output.fileoutputformat.compress=true -D <hive metastore path>/* <backup path within hdfs>

The above command if I am not mistaken will do compression during the transfer from one path to another. But the resultant file in the backup path will not be compressed.

First question here is whether using distcp to copy within hdfs would be faster than using a copy command.

Second question is what is the best way to compress the files after copying to the backup directory.

Something similar to tar or gzip is what I am looking for.

Appreciate the insights.