Counting rows in multiple partitions in Hive query

Aaron_Dossett — Tue, 01 Dec 2015 22:14:25 GMT

For a partitioned Hive table (stored as ORC), I can count the rows in a partition very quickly with a query like this, presumably because Hive gets the count directly from table statistics:

select count(*) from db.table where partition_date = '12-01-2015'

How can I just as quickly get counts from multiple partitions? A query like this launches a full tez job and takes a couple dozen seconds to run depending on the date range I choose:

select partition_date, count(*) from db.table where partition_date >= '11-01-2015' group by partition_date

Thanks!

I am running Hive 0.14 if that is relevant.

Re: Counting rows in multiple partitions in Hive query

sluangsay — Tue, 01 Dec 2015 22:34:08 GMT

Some few months ago I asked a similar question and I got that reply:

https://issues.apache.org/jira/browse/HIVE-11937

So, I don't think you can use the stats in Hive 0.14 for the kind of query you want to do. Maybe with the next Hive version.

A possible workaround would be to get the names of all your partitions in that table, and to have a script (in python, bash or a java program) that generates a query for each partition. Not sure it works but you might give it a try.

question Re: Counting rows in multiple partitions in Hive query in Archives of Support Questions (Read Only)

Counting rows in multiple partitions in Hive query

Re: Counting rows in multiple partitions in Hive query