Support Questions

Find answers, ask questions, and share your expertise

ORC with Zlib vs ORC with No Compression

avatar
Master Guru

If I am only concerned with performance of HiveQL queries and not about storage space.

Is it better to create my tables with no compression?

http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy

No if my datasize is in a few hundred gigs?

terabytes?

a petabyte?

Same answer?

Let's assume sane queries returning a few thousand rows, a dozen columns or so, a few items in a where clause and an order by. Interactive queries for Tableau.

1 ACCEPTED SOLUTION

avatar

@Timothy Spann

Compression can improve the performance of Hive queries by decreasing the amount of data that needs to be read from disk (reading compressed vs. uncompressed). Just because your query only returns a small number of rows, doesn't mean that the processing isn't going to read lots of data to process the query, too. There would be an inflection point where reading the uncompressed data costs less than uncompressing the data (small data sets) where you might want to skip the compression. However, when larger amounts of data need to be read to fulfill a query, compressing the data can provide performance gains.

View solution in original post

3 REPLIES 3

avatar

@Timothy Spann

Compression can improve the performance of Hive queries by decreasing the amount of data that needs to be read from disk (reading compressed vs. uncompressed). Just because your query only returns a small number of rows, doesn't mean that the processing isn't going to read lots of data to process the query, too. There would be an inflection point where reading the uncompressed data costs less than uncompressing the data (small data sets) where you might want to skip the compression. However, when larger amounts of data need to be read to fulfill a query, compressing the data can provide performance gains.

avatar
Expert Contributor

To add to @emaxwell

> If I am only concerned with performance

The bigger win is being able to skip decompressing blocks entirely - if you have hive.optimize.index.filter=true, that will kick in.

> a few items in a where clause

That's where the ORC indexes matter - if you have orc.create.index=true & orc.bloom.filter.columns contain those columns specifically (using "*" is easy, but slows down ETL when tables are wider and the measures are random)

Clustering & Sorting on the most common column in the filter there can give you 2-3 magnitudes of performance (sorting specifically, because the min/max are stored at the footer of a file - this naturally happens for most ETL for date/timestamp columns, but for something randomly distributed like a location id, that is a big win).

See for example, this ETL script https://github.com/t3rmin4t0r/all-airlines-data/blob/master/ddl/orc.sql#L78

avatar
Master Guru

I did a couple test queries on a 15GB data set once, uncompressed was slowest everything else was more or less identical. I tested Snappy and Zlib in 64 and 256MB stripes.

However it was heavily tuned for short running queries on a single column so the majority of time was not in the execution.

You might have slightly different results in a heavily CPU restricted system. However as Gopal said proper sorting and distribution is much more important.