Support Questions

Find answers, ask questions, and share your expertise

Compression/Zipping of old unused HDFS files in Production Cluster

avatar
Explorer

Hey Hadoopers,

 

We are using CDH and we are planning to save some space for the future incoming data on HDFS. So, we were thinking if we could compress/Zip old HDFS data so that we could reduce some space.

 

1. Is Compression or zipping of HDFS files possible ?

2. Can we compress it and store it on local file system or HDFS file system

3. If yes, how to uncompress the data and keep it back to the same HDFS in future with the exact Metadata information or path where it existed before. 

4. Will it compress even with the original file path or metadata ?

 

Thanks a ton ! 

 

 

1 REPLY 1

avatar
Contributor

Hello there,

I understand your use-case to save up some HDFS space. Though I haven't tested zipping possibilities for hdfs level files[2]. Alternately, you may consider reviewing HDFS Erasure Coding[1] if that suits your requirement: 
ErasureCoding in HDFS significantly reduces storage overhead while achieving similar or better fault tolerance through the use of parity cells (similar to RAID5). Prior to the introduction of EC, HDFS used 3x replication for fault tolerance exclusively, meaning that a 1GB file would use 3 GB of raw disk space. With EC, the same level of fault tolerance can be achieved using only 1.5 GB of raw disk space.
Please refer the below article[1] for more insights on EC:

Ref[1]: https://blog.cloudera.com/hdfs-erasure-coding-in-production/

[2] https://docs.cloudera.com/cloudera-manager/7.2.6/managing-clusters/topics/cm-choosing-configuring-da...