Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Counting rows in multiple partitions in Hive query

avatar
Rising Star

For a partitioned Hive table (stored as ORC), I can count the rows in a partition very quickly with a query like this, presumably because Hive gets the count directly from table statistics:

select count(*) from db.table where partition_date = '12-01-2015'

How can I just as quickly get counts from multiple partitions? A query like this launches a full tez job and takes a couple dozen seconds to run depending on the date range I choose:

select partition_date, count(*) from db.table where partition_date >= '11-01-2015' group by partition_date

Thanks!

I am running Hive 0.14 if that is relevant.

1 ACCEPTED SOLUTION

avatar
Super Collaborator

Some few months ago I asked a similar question and I got that reply:

https://issues.apache.org/jira/browse/HIVE-11937

So, I don't think you can use the stats in Hive 0.14 for the kind of query you want to do. Maybe with the next Hive version.

A possible workaround would be to get the names of all your partitions in that table, and to have a script (in python, bash or a java program) that generates a query for each partition. Not sure it works but you might give it a try.

View solution in original post

1 REPLY 1

avatar
Super Collaborator

Some few months ago I asked a similar question and I got that reply:

https://issues.apache.org/jira/browse/HIVE-11937

So, I don't think you can use the stats in Hive 0.14 for the kind of query you want to do. Maybe with the next Hive version.

A possible workaround would be to get the names of all your partitions in that table, and to have a script (in python, bash or a java program) that generates a query for each partition. Not sure it works but you might give it a try.