Support Questions

Find answers, ask questions, and share your expertise

Counting rows in multiple partitions in Hive query

avatar
Rising Star

For a partitioned Hive table (stored as ORC), I can count the rows in a partition very quickly with a query like this, presumably because Hive gets the count directly from table statistics:

select count(*) from db.table where partition_date = '12-01-2015'

How can I just as quickly get counts from multiple partitions? A query like this launches a full tez job and takes a couple dozen seconds to run depending on the date range I choose:

select partition_date, count(*) from db.table where partition_date >= '11-01-2015' group by partition_date

Thanks!

I am running Hive 0.14 if that is relevant.

1 ACCEPTED SOLUTION

avatar
Super Collaborator

Some few months ago I asked a similar question and I got that reply:

https://issues.apache.org/jira/browse/HIVE-11937

So, I don't think you can use the stats in Hive 0.14 for the kind of query you want to do. Maybe with the next Hive version.

A possible workaround would be to get the names of all your partitions in that table, and to have a script (in python, bash or a java program) that generates a query for each partition. Not sure it works but you might give it a try.

View solution in original post

1 REPLY 1

avatar
Super Collaborator

Some few months ago I asked a similar question and I got that reply:

https://issues.apache.org/jira/browse/HIVE-11937

So, I don't think you can use the stats in Hive 0.14 for the kind of query you want to do. Maybe with the next Hive version.

A possible workaround would be to get the names of all your partitions in that table, and to have a script (in python, bash or a java program) that generates a query for each partition. Not sure it works but you might give it a try.