It appears from this JIRA that hive support for bucket pruning is available from version 2.0.0. Looks like this only possible with Tez by setting the property hive.tez.bucket.pruning.
What are the options to achieve the performance improvement like bucket pruning till HDP is available with Hive 2.0.0?
like I wrote in the other topic. Bucket pruning is not yet available so you can use two ways to reduce the amount of data that needs to be processed.
- Partitioning ( and partition pruning during joins). The limitation is that too many partitions are hard on the hive server and the metastore so you shouldn't have more than a couple thousand partitions. But they are still the most efficient way to reduce data read during queries and work also across joins.
- Predicate Pushdown in ORC.
Added some infomation on how to optimize that in the document below. In general this is applied in the Map task before data is read from HDFS. The map task opens the ORC index and skips all stripes and blocks that cannot contain relevant data based on min/max values (always on ) and bloom filters ( if configured). To do this efficiently data needs to be sorted ( happens more or less automatically for dates during continuous loads ) or you need to sort data during insert. You can use bucketing as well to "sort" data. However this is a double edged sword because you essentially put all relevant data in one bucket or file that file will take long to process. I have used bucketing + sorting in the past to essentially get predicate pushdown on two columns ( customer id and date ) because queries were restricting on one or the other dimension.
Hey @Benjamin Leonhardi,
I run Hive 0.13.1.
I have created ORC tables which surely give me big savings in space. But till now, I am not able to find any reduction in read speeds (vs sequence files). If I create a ORC table A of size say 30GB and then run a select query on one column, the maprfs read I see is almost same as 30GB, which indicates me no pruning was done before mapper.
I am not able to think of any reasons as of now why Predicate Pushdown is not getting into action. Maybe my Hive and Hadoop version do not support it. Can you help me.
Hello @Benjamin Leonhardi
I really like your slide you mentioned above. Just great. (http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data)
So,, I have additional question.
In short, for now , If we use Transaction ORC table, It is hard to optimizing the query because transactional table could not come with sort. right?
could you please check out my email thread to hive user group?
it is about bucketing and tablesample..
Thanks a lot. And you are correct you have limited control over a transaction table because data will stream in as it comes. However you get pretty much automatically a sort by date or any continuously increasing number ( any transation key for example )
Regarding your question:
The problem is that hadoop is parallel. So bucketing by your where condition is actually bad sometimes. It means a lot of mappers will be empty and close immediately but one has all the values and will take a long time. And the longest task defines the query length. So it can be better to bucket by something else and sort by the where condition. But that depends. If your cluster is full and you want high throughput of many queries at the same time the story is different. I didn't look too much into it but just as a general point. You normally have to have a deeper look in questions like this