Created 06-06-2016 08:48 PM
If I am only concerned with performance of HiveQL queries and not about storage space.
Is it better to create my tables with no compression?
http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
No if my datasize is in a few hundred gigs?
terabytes?
a petabyte?
Same answer?
Let's assume sane queries returning a few thousand rows, a dozen columns or so, a few items in a where clause and an order by. Interactive queries for Tableau.
Created 06-06-2016 09:03 PM
Compression can improve the performance of Hive queries by decreasing the amount of data that needs to be read from disk (reading compressed vs. uncompressed). Just because your query only returns a small number of rows, doesn't mean that the processing isn't going to read lots of data to process the query, too. There would be an inflection point where reading the uncompressed data costs less than uncompressing the data (small data sets) where you might want to skip the compression. However, when larger amounts of data need to be read to fulfill a query, compressing the data can provide performance gains.
Created 06-06-2016 09:03 PM
Compression can improve the performance of Hive queries by decreasing the amount of data that needs to be read from disk (reading compressed vs. uncompressed). Just because your query only returns a small number of rows, doesn't mean that the processing isn't going to read lots of data to process the query, too. There would be an inflection point where reading the uncompressed data costs less than uncompressing the data (small data sets) where you might want to skip the compression. However, when larger amounts of data need to be read to fulfill a query, compressing the data can provide performance gains.
Created 06-07-2016 06:15 AM
To add to @emaxwell
> If I am only concerned with performance
The bigger win is being able to skip decompressing blocks entirely - if you have hive.optimize.index.filter=true, that will kick in.
> a few items in a where clause
That's where the ORC indexes matter - if you have orc.create.index=true & orc.bloom.filter.columns contain those columns specifically (using "*" is easy, but slows down ETL when tables are wider and the measures are random)
Clustering & Sorting on the most common column in the filter there can give you 2-3 magnitudes of performance (sorting specifically, because the min/max are stored at the footer of a file - this naturally happens for most ETL for date/timestamp columns, but for something randomly distributed like a location id, that is a big win).
See for example, this ETL script https://github.com/t3rmin4t0r/all-airlines-data/blob/master/ddl/orc.sql#L78
Created 06-07-2016 12:13 PM
I did a couple test queries on a 15GB data set once, uncompressed was slowest everything else was more or less identical. I tested Snappy and Zlib in 64 and 256MB stripes.
However it was heavily tuned for short running queries on a single column so the majority of time was not in the execution.
You might have slightly different results in a heavily CPU restricted system. However as Gopal said proper sorting and distribution is much more important.