Support Questions
Find answers, ask questions, and share your expertise

BLOB with ORC in Hive

Explorer

This is the access pattern:

Select datetime from tableName where id =?

Select content from tableName where id =? And datetime=?

One of the columns is a BLOB and the table is in ORC format. (The table needs to be transactional)

Because of the BLOB, the read times are high.

Any recommendations on optimization? This is on HDP 2.3.

Thanks,

Kiran

1 ACCEPTED SOLUTION

Accepted Solutions

Explorer

Here's the recommendation from a Hive SME:

You should start by checking off the typical recommendations https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.5/bk_hive-performance-tuning/bk_hive-performa...

Especially partitioning, depending on how you are accessing your datetime field you may not benefit at all from partitioning pruning.

A safe / proven path is to partition by date and use either an explicit partition key filter or a dimension lookup that allows Hive to infer partition keys from the datetime field.

I don't recall seeing any other specific blob tuning techniques. Ideally you would lazy load the BLOB only if the ID matches but I don't believe there is a way to control that.

One way to get closer to that is to have the ID / datetime mapping in a separate table without the BLOBs. Populating the list of datetimes (query 1) would be faster that way.

Other thoughts:

You should try Hive 2 (in HDP: enable LLAP) which has a bucket pruning optimization, if you cluster by ID it would scan fewer files. I see you are on 2.3 but this could be an incentive to move.

You may try experimenting with ORC stripe sizes.

You might try compressing the blobs to speed the search for a specific ID (if it is a point lookup). The application would need to decompress it.

Long story short, only the 2 pruning options above are system-level optimizations, other than that you are probably looking at dealing with this at the app layer.

View solution in original post

1 REPLY 1

Explorer

Here's the recommendation from a Hive SME:

You should start by checking off the typical recommendations https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.5/bk_hive-performance-tuning/bk_hive-performa...

Especially partitioning, depending on how you are accessing your datetime field you may not benefit at all from partitioning pruning.

A safe / proven path is to partition by date and use either an explicit partition key filter or a dimension lookup that allows Hive to infer partition keys from the datetime field.

I don't recall seeing any other specific blob tuning techniques. Ideally you would lazy load the BLOB only if the ID matches but I don't believe there is a way to control that.

One way to get closer to that is to have the ID / datetime mapping in a separate table without the BLOBs. Populating the list of datetimes (query 1) would be faster that way.

Other thoughts:

You should try Hive 2 (in HDP: enable LLAP) which has a bucket pruning optimization, if you cluster by ID it would scan fewer files. I see you are on 2.3 but this could be an incentive to move.

You may try experimenting with ORC stripe sizes.

You might try compressing the blobs to speed the search for a specific ID (if it is a point lookup). The application would need to decompress it.

Long story short, only the 2 pruning options above are system-level optimizations, other than that you are probably looking at dealing with this at the app layer.

View solution in original post