How to create a hive table (text/sequencefile) with compression enabled?
While creating orc table I gave orc.compress = SNAPPY, but is there any similar option for sequence file/text file? or do I need to just enable compression as below before the insert statement?
I tried to set the above properties and created a table stored as sequencefile and inserted data into that table. But when I checked the compression using "describe formatted tablename"
I still see that compression under storage format as "No". Why is this? Is this the right way to check the table compression? And is it possible to store the textfile too in compressed format?
And even though I enable compression for ORC table as SNAPPY compression, I am able to insert data into SNAPPY compressed orc table irrespective of whether I set the above three SET parameters or not. Even if I set the compression as false, I am able to insert the data into SNAPPY compressed ORC table. SO I am a little confused on the usage of above three parameters. Can someone please explain the usage of the above three settings?
1. With above parameters, you will get a compresssed sequencefile. But since this is file level compression, you won't see it on hive metadata (with describe formatted tablename).
2. For getting compression with textfiles, you can directly put compressed files in the external table (like file1.gz and file2.gz) in the folder and hive can use them. However, if these files are big, you will not get any splits which will end up as 1 mapper per file.
3. When you use ORC, you don't need to explicitly use hive.exec.compress.outoput and mapred compression as ORCSerDe takes care of compression based on table properties.
Hi, Thanks for your reply.
So if I set the compression code for Textfile/Sequencefile as gzip, bzip, LZO or snappy, then will those files created after insert statement are splittable? or need to be read in entirely one node?
What about the other file formats like parquet, Avro and RC file? So whether a file is split or not is based on File format or compression method? It is compression method right?
What are the compression methods that would make the file data non splittable?
And what actually the compressed means in the output of describe formatted? Is that some table level compression? how to enable it?
So if I understand the answers right, the compression method for sequence file is based on the parameters set before the insert statement. If this is the case, what if I enable the compression and set to snappy before first insert statement, and set to different compression codec for the second insert, and disable the compression before the third insert. Would this create three different file formats in the background? I tried this, I am able to select the data from table, not sure if it creates three different file formats in background? Any idea?
And If one of these compression methods are non splittable, then only that mapper that reads the data from that file would read the entire file right? the other mappers reading the other files would still read the SPLIT data right?
So ORC is completely on its own. They follow their own logic and use compression internally. So the files itself are not compressed. The parameters you used don't matter for them. They are intended for Sequence files/delimited ons.
For Sequence files I think you did everything correctly:
For Sequence files ( and delimited files ) essentially hive just uses Hadoop inputformats to read them.. They natively support compressed files and either have it in the sequence file header or in the filename of the delimited file ( .gz will be unpacked with gzip for example ) So Mapreduce/Tez knows if the data is compressed or not and will just unpack it. I have actually no idea what the Compressed means in the describe. Perhaps someone else has an idea there.
However in general my tip: Use ORC for any final table format ( you do not need to worry about compression just set the orc parameters and please use zip, its slightly faster now than snappy and compresses much better. Hive has been heavily optimized for it )
Delimited Files should mainly be used for import/export so normally you will get the files already compressed or you use a tool like pig/spark who compress output that you then import into hadoop.
Sequence files can be a good way of storing temp tables etc. because they are very fast to read/write. What you did should be correct. I would just check the folder size with hadoop fs -du /apps/hive/warehouse/mydb/mytable to see if it gets smaller when you enable compression. Not sure what the describe formatted refers to, I don't think Hive should actually know if the data is compressed or not. ( without checking the data, but might be worthwhile to check the code
Thanks for the suggestion, Actually I am just using this sequence table as a place to land the data from external source with little to no modification to the data. And using ORC file for final table which would be used for querying.
But since the data in this sequence table would be retained for around 100 days, I wanted the data to be compressed without compromising the read performance, thats why I chose snappy compression over others, thinking that it might be faster to query.
for Sequence files Snappy might be good. ORCs are heavily optimized for Zip now. But honestly I would just try it with one table. Have a look at sizes in the file system.