- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Counting rows in multiple partitions in Hive query
- Labels:
-
Apache Hive
-
Apache Tez
Created ‎12-01-2015 02:14 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For a partitioned Hive table (stored as ORC), I can count the rows in a partition very quickly with a query like this, presumably because Hive gets the count directly from table statistics:
select count(*) from db.table where partition_date = '12-01-2015'
How can I just as quickly get counts from multiple partitions? A query like this launches a full tez job and takes a couple dozen seconds to run depending on the date range I choose:
select partition_date, count(*) from db.table where partition_date >= '11-01-2015' group by partition_date
Thanks!
I am running Hive 0.14 if that is relevant.
Created ‎12-01-2015 02:34 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Some few months ago I asked a similar question and I got that reply:
https://issues.apache.org/jira/browse/HIVE-11937
So, I don't think you can use the stats in Hive 0.14 for the kind of query you want to do. Maybe with the next Hive version.
A possible workaround would be to get the names of all your partitions in that table, and to have a script (in python, bash or a java program) that generates a query for each partition. Not sure it works but you might give it a try.
Created ‎12-01-2015 02:34 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Some few months ago I asked a similar question and I got that reply:
https://issues.apache.org/jira/browse/HIVE-11937
So, I don't think you can use the stats in Hive 0.14 for the kind of query you want to do. Maybe with the next Hive version.
A possible workaround would be to get the names of all your partitions in that table, and to have a script (in python, bash or a java program) that generates a query for each partition. Not sure it works but you might give it a try.
