Created 03-31-2021 01:17 AM
Hi,
on Hive 3.1.4 we have a COUNT(*) query which returns more than one rows, instead of exactly one.
I created this test table:
CREATE TABLE test_case (
cod_pers STRING,
cod_address STRING,
PRIMARY KEY (cod_pers) DISABLE NOVALIDATE )
PARTITIONED BY (num_snapshot BIGINT) ;
and populated it with more than 500,000 rows in 3 partitions.
To find how many people changed address, I wrote this query:
WITH data1 AS ( -- eliminate duplicates
select t.cod_pers, t.cod_address, count(*) as num_address
from test_case as t
group by t.cod_pers, t.cod_address
), data2 AS ( -- find changes per person
select s.cod_pers, count(*) as num_changes
from data1 s
group by s.cod_pers
having count(*)>1 )
select count(*) as num_all_changes from data2 as gg ;
Instead of obtaining a single row with the total number, the query returns 4 rows :
+------------------+
| num_all_changes |
+------------------+
| 63 |
| 58 |
| 64 |
| 59 |
+------------------+
4 rows selected (1.252 seconds)
4 is also the number of reducers used by the query.
If I add a "LIMIT 1" clause, the query works as expected, and a final reducer is added in the query execution. The same happens adding "GROUP BY 1" clause.
I tried to COMPACT the table and have the statistics recalculated, but nothing changed.
I know I can rewrite the query, but I'm not looking for a work-aroud: I wonder if it's a known bug of CBO and if someone else experimented the same behavior (and also if a patch is available).
Thanks in advance
Created 03-31-2021 02:08 AM
Can you please try the below command:
select count(*) from data2;
It should return one single row .
Created 03-31-2021 02:18 AM
Hi,
the simple count(*) on the base table is working correctly:
select count(*) from be_prd_prt.test_case;
+---------+
| _c0 |
+---------+
| 537586 |
+---------+
1 row selected (0.2 seconds)
I have troubles with the query I wrote, because it returns a wrong number of rows (not just 1, as expected)
Created 04-05-2021 05:16 AM
can you collect STATS and re-run the query??
Created 04-07-2021 12:26 AM
Hi asish,
I collected statistics again (I already did it before posting), but the result is the same: 4 rows instead 1.
+------------------+
| num_all_changes |
+------------------+
| 63 |
| 58 |
| 64 |
| 59 |
+------------------+
4 rows selected (13.45 seconds)
Created 04-05-2021 05:56 AM
Created on 04-07-2021 12:31 AM - edited 04-07-2021 12:47 AM
Hi @Magudeswaran ,
I can rewrite the query using sub-queries, but I have the same error:
select count(*) as num_all_changes
from (
select s.cod_pers, count(*) as num_changes
from (
select t.cod_pers, t.cod_address, count(*) as num_address
from be_prd_prt.test_case as t
group by t.cod_pers, t.cod_address
) as s
group by s.cod_pers
having count(*)>1
) as gg
;
+------------------+
| num_all_changes |
+------------------+
| 63 |
| 58 |
| 64 |
| 59 |
+------------------+
4 rows selected (18.077 seconds)
As you can see, always 4 rows...