Support Questions
Find answers, ask questions, and share your expertise

hive not using partition on group by

New Contributor

A weird behaviour I never saw.

A have parquet files in my HDFS generated by Spark, split into 3 partitions.

Here is the count for all partitions. It was calculated by Spark.

<code>+----------+--------+
|created_at|count(1)|
+----------+--------+
|  20190101|12774895|
|  20181231|18648432|
|  20190102|30010065|
+----------+--------+

Total: 61433392 records

Hive

Created a table.

<code>CREATE EXTERNAL TABLE raw.event (
    account_type STRING,
    event_name STRING,
    payload MAP<STRING, STRING>
)
PARTITIONED BY(created_at INT, product_name STRING, event_type STRING)
STORED AS PARQUET LOCATION '/datalake/raw/event'
TBLPROPERTIES('PARQUET.COMPRESS'='SNAPPY');

Added all partitions.

show partitions raw.event;

<code>+---------------------------------------------------------------------+--+
|                              partition                              |
+---------------------------------------------------------------------+--+
| created_at=20181231/product_name=A/event_type=X  |
| created_at=20190101/product_name=A/event_type=X  |
| created_at=20190102/product_name=A/event_type=X  |
+---------------------------------------------------------------------+--+

Execute a full count to make sure everything run smoothly:

0: jdbc:hive2://headnodehost:10001/> select count(*) from raw.event;

<code>+-----------+--+
|    _c0    |
+-----------+--+
| 61433392  |
+-----------+--+

Expected result! Ok 🙂

Now count just from one partition.

0: jdbc:hive2://headnodehost:10001/> select count(*) from raw.event where created_at=20190102 and product_name='A' and event_type='X';

Instead get the expected 12774895 count rows I am still getting the full count.

<code>+-----------+--+
|    _c0    |
+-----------+--+
| 61433392  |
+-----------+--+

Now I am trying to count and group by the created_at partition.

0: jdbc:hive2://headnodehost:10001/> SELECT created_at, count(*) FROM raw.event group by created_at;

<code>+-------------+-----------+--+
| created_at  |    _c1    |
+-------------+-----------+--+
| 20190102    | 61433392  |
+-------------+-----------+--+

I am still getting the full count of rows, and just the last added partition.

I am pretty sure the parquet files contains different created_at values.

Any suggestion?

0 REPLIES 0