Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Any problems with using multiple compression formats for different partitions of the same table?

Any problems with using multiple compression formats for different partitions of the same table?

Explorer

I have a table that is the outpout of an entire data model.  I load new data to it each day, and each day is a partition.

I call the partition 'load_dt', so '2016-01-01' is a partition that is a copy of the data at the end of the day on 2016-01-01, load_dt='2016-01-02' is a copy of the data model as of '2016-01-02', and so on.  Each partition is stored as parquet file format with snappy compression.

 

After a certain amount of time I am far less likely to need fast access to the data, so I am considering re-loading old partitions but as textfile with bz2 compression so that they take up less space.  So more recent partitions are partquet/snappy and in the same table older partitions are text/bz2.

 

Other than slower performance if i'm using the older, more highly compressed partitions, are there any other issues I should expect to run into when I try this?  Or, if it's not a good idea to mix file/compression in the same table can someone suggest a best-practice for archiving?  Thanks!

Don't have an account?
Coming from Hortonworks? Activate your account here