Created on 01-16-2019 01:00 PM
Note: this article only deals with the disk space of each format, not the performance comparison.
When using Hive as your engine for SQL queries, you might want to consider using ORC or Parquet file formats for your data. There are numerous advantages to consider when choosing ORC or Parquet. Firstly, both will give you columnar compression of the data within, whereas a plain text file will have no compression at all. Secondly, indexes within ORC or Parquet will help with query speed as some basic statistics are stored inside the files, such as min,max value, number of rows etc. Both formats can also express complex data structures (such as hierarchies) which a plain CSV file cannot do. The combination of these can boost your query many times over. Have a look at the article here to read more about the underlying mechanics of both formats: https://www.datanami.com/2018/05/16/big-data-file-formats-demystified/
More information on the ORC file format: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC
More information on the Parquet file format: https://parquet.apache.org/documentation/latest/
However, what is the space on disk that is used for these formats in Hive ? Saving on disk space (and inadvertently, IO), is always a good thing, but it can be hard to calculate exactly how much space you will be using with compression. Obviously, every file and data set is different, and the data inside will always be a determining factor for what type of compression you'll get. Text will compress better than binary data. Repeating values and strings will compress better than pure random data, and so forth.
As a simple test, I took the 2008 data set from http://stat-computing.org/dataexpo/2009/the-data.html
The compressed bz2 download measures at 108.5 Mb, and uncompressed at 657.5 Mb
I then uploaded the data to HDFS, and created an external table on top of the uncompressed data set:
Create external table flight_arrivals ( year int, month int, DayofMonth int, DayOfWeek int, DepTime int, CRSDepTime int, ArrTime int, CRSArrTime int, UniqueCarrier string, FlightNum int, TailNum string, ActualElapsedTime int, CRSElapsedTime int, AirTime int, ArrDelay int, DepDelay int, Origin string, Dest string, Distance int, TaxiIn int, TaxiOut int, Cancelled int, CancellationCode int, Diverted int, CarrierDelay string, WeatherDelay string, NASDelay string, SecurityDelay string, LateAircraftDelay string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE location '/flight_arrivals';
Counting the rows:
select count(*) from flight_arrivals; +----------+ | _c0 | +----------+ | 7009728 | +----------+ 1 row selected (3.341 seconds)
Converting to an ORC and Parquet table, using default settings for both formats:
Create external table flight_arrivals_external_orc stored as ORC as select * from flight_arrivals; Create external table flight_arrivals_external_parquet stored as Parquet as select * from flight_arrivals;
Let's have a look at the disk usage for both:
93.5 M /warehouse/tablespace/external/hive/flight_arrivals_external_orc 146.6 M /warehouse/tablespace/external/hive/flight_arrivals_external_parquet
|CSV (Text)||657.5 Mb||-|
One should keep in mind that default settings and values were used to create the ORC and Parquet tables, as well as no other optimizations were used for either formats. Both ORC and Parquet ships with many options & optimizations to compress your data, only the defaults which ships with HDP 3.1 were used. Consider using a sample or subset of your data set, and play with the various compression algorithms and options for each to find the one that suits your use case best.