Support Questions

kerra · ‎09-05-2017

This is the access pattern:

Select datetime from tableName where id =?

Select content from tableName where id =? And datetime=?

One of the columns is a BLOB and the table is in ORC format. (The table needs to be transactional)

Because of the BLOB, the read times are high.

Any recommendations on optimization? This is on HDP 2.3.

Thanks,

Kiran

kerra · ‎09-07-2017

Here's the recommendation from a Hive SME:

You should start by checking off the typical recommendations https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.5/bk_hive-performance-tuning/bk_hive-performa...

Especially partitioning, depending on how you are accessing your datetime field you may not benefit at all from partitioning pruning.

A safe / proven path is to partition by date and use either an explicit partition key filter or a dimension lookup that allows Hive to infer partition keys from the datetime field.

I don't recall seeing any other specific blob tuning techniques. Ideally you would lazy load the BLOB only if the ID matches but I don't believe there is a way to control that.

One way to get closer to that is to have the ID / datetime mapping in a separate table without the BLOBs. Populating the list of datetimes (query 1) would be faster that way.

Other thoughts:

You should try Hive 2 (in HDP: enable LLAP) which has a bucket pruning optimization, if you cluster by ID it would scan fewer files. I see you are on 2.3 but this could be an incentive to move.

You may try experimenting with ORC stripe sizes.

You might try compressing the blobs to speed the search for a specific ID (if it is a point lookup). The application would need to decompress it.

Long story short, only the 2 pruning options above are system-level optimizations, other than that you are probably looking at dealing with this at the app layer.

View solution in original post

kerra · ‎09-07-2017

Here's the recommendation from a Hive SME:

You should start by checking off the typical recommendations https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.5/bk_hive-performance-tuning/bk_hive-performa...

Especially partitioning, depending on how you are accessing your datetime field you may not benefit at all from partitioning pruning.

A safe / proven path is to partition by date and use either an explicit partition key filter or a dimension lookup that allows Hive to infer partition keys from the datetime field.

I don't recall seeing any other specific blob tuning techniques. Ideally you would lazy load the BLOB only if the ID matches but I don't believe there is a way to control that.

One way to get closer to that is to have the ID / datetime mapping in a separate table without the BLOBs. Populating the list of datetimes (query 1) would be faster that way.

Other thoughts:

You should try Hive 2 (in HDP: enable LLAP) which has a bucket pruning optimization, if you cluster by ID it would scan fewer files. I see you are on 2.3 but this could be an incentive to move.

You may try experimenting with ORC stripe sizes.

You might try compressing the blobs to speed the search for a specific ID (if it is a point lookup). The application would need to decompress it.

Long story short, only the 2 pruning options above are system-level optimizations, other than that you are probably looking at dealing with this at the app layer.

Cloudera Community

Support Questions

BLOB with ORC in Hive

Optimizing Hive queries for ORC formatted tables

ORC Creation Best Practices

Performance Comparison b/w ORC SNAPPY and ZLib in ...

How to compact ORC files on Hive.

Hive table compression: bz2 vs Text vs Orc vs Parq...

ORC Improvements for Apache Spark 2.2

Import RDBMS into Hive table stored as ORC with SQ...

Hive Query against ORC table failing with serious ...

Converting CSV Files to Apache Hive Tables with Ap...

Reading ORC files using Mapreduce