Note: this article only deals with the disk space of each format, not the performance comparison.
When using Hive as your engine for SQL queries, you might want to consider using ORC or Parquet file formats for your data. There are numerous advantages to consider when choosing ORC or Parquet.
Firstly, both will give you columnar compression of the data within, whereas a plain text file will have no compression at all. Secondly, indexes within ORC or Parquet will help with query speed as some basic statistics are stored inside the files, such as min,max value, number of rows etc. Both formats can also express complex data structures (such as hierarchies) which a plain CSV file cannot do. The combination of these can boost your query many times over. Have a look at the article here to read more about the underlying mechanics of both formats: https://www.datanami.com/2018/05/16/big-data-file-formats-demystified/
However, what is the space on disk that is used for these formats in Hive ? Saving on disk space (and inadvertently, IO), is always a good thing, but it can be hard to calculate exactly how much space you will be using with compression. Obviously, every file and data set is different, and the data inside will always be a determining factor for what type of compression you'll get. Text will compress better than binary data. Repeating values and strings will compress better than pure random data, and so forth.
Converting to an ORC and Parquet table, using default settings for both formats:
Create external table flight_arrivals_external_orc stored as ORC as select * from flight_arrivals;
Create external table flight_arrivals_external_parquet stored as Parquet as select * from flight_arrivals;
Let's have a look at the disk usage for both:
93.5 M /warehouse/tablespace/external/hive/flight_arrivals_external_orc
146.6 M /warehouse/tablespace/external/hive/flight_arrivals_external_parquet
In Summary:
Format
Size
Compressed %
bz2
108.5 Mb
16.5%
CSV (Text)
657.5 Mb
-
ORC
93.5 Mb
14.2%
Parquet
146.6 Mb
22.3%
One should keep in mind that default settings and values were used to create the ORC and Parquet tables, as well as no other optimizations were used for either formats. Both ORC and Parquet ships with many options & optimizations to compress your data, only the defaults which ships with HDP 3.1 were used. Consider using a sample or subset of your data set, and play with the various compression algorithms and options for each to find the one that suits your use case best.