Community Articles

Find and share helpful community-sourced technical articles.
Labels (1)
avatar
Contributor

Note: this article only deals with the disk space of each format, not the performance comparison.

When using Hive as your engine for SQL queries, you might want to consider using ORC or Parquet file formats for your data. There are numerous advantages to consider when choosing ORC or Parquet. Firstly, both will give you columnar compression of the data within, whereas a plain text file will have no compression at all. Secondly, indexes within ORC or Parquet will help with query speed as some basic statistics are stored inside the files, such as min,max value, number of rows etc. Both formats can also express complex data structures (such as hierarchies) which a plain CSV file cannot do. The combination of these can boost your query many times over. Have a look at the article here to read more about the underlying mechanics of both formats: https://www.datanami.com/2018/05/16/big-data-file-formats-demystified/

More information on the ORC file format: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC

More information on the Parquet file format: https://parquet.apache.org/documentation/latest/

However, what is the space on disk that is used for these formats in Hive ? Saving on disk space (and inadvertently, IO), is always a good thing, but it can be hard to calculate exactly how much space you will be using with compression. Obviously, every file and data set is different, and the data inside will always be a determining factor for what type of compression you'll get. Text will compress better than binary data. Repeating values and strings will compress better than pure random data, and so forth.

As a simple test, I took the 2008 data set from http://stat-computing.org/dataexpo/2009/the-data.html

The compressed bz2 download measures at 108.5 Mb, and uncompressed at 657.5 Mb

I then uploaded the data to HDFS, and created an external table on top of the uncompressed data set:

Create external table flight_arrivals (
year               int,
month              int,
DayofMonth         int,
DayOfWeek          int,
DepTime            int,
CRSDepTime         int,
ArrTime            int,
CRSArrTime         int,
UniqueCarrier      string,
FlightNum          int,
TailNum            string,
ActualElapsedTime  int,
CRSElapsedTime     int,
AirTime            int,
ArrDelay           int,
DepDelay           int,
Origin             string,
Dest               string,
Distance           int,
TaxiIn             int,
TaxiOut            int,
Cancelled          int,
CancellationCode   int,
Diverted           int,
CarrierDelay       string,
WeatherDelay       string,
NASDelay           string,
SecurityDelay      string,
LateAircraftDelay  string
) 
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
location '/flight_arrivals';

Counting the rows:

select count(*) from flight_arrivals;
+----------+
|   _c0    |
+----------+
| 7009728  |
+----------+
1 row selected (3.341 seconds)

Converting to an ORC and Parquet table, using default settings for both formats:

Create external table flight_arrivals_external_orc stored as ORC as select * from flight_arrivals;
Create external table flight_arrivals_external_parquet stored as Parquet as select * from flight_arrivals;

Let's have a look at the disk usage for both:

93.5 M   /warehouse/tablespace/external/hive/flight_arrivals_external_orc
146.6 M  /warehouse/tablespace/external/hive/flight_arrivals_external_parquet

In Summary:

FormatSizeCompressed %
bz2108.5 Mb16.5%
CSV (Text)657.5 Mb-
ORC93.5 Mb14.2%
Parquet146.6 Mb22.3%

One should keep in mind that default settings and values were used to create the ORC and Parquet tables, as well as no other optimizations were used for either formats. Both ORC and Parquet ships with many options & optimizations to compress your data, only the defaults which ships with HDP 3.1 were used. Consider using a sample or subset of your data set, and play with the various compression algorithms and options for each to find the one that suits your use case best.

11,163 Views