Support Questions

giovannimori · ‎03-31-2021

Hi,

on Hive 3.1.4 we have a COUNT(*) query which returns more than one rows, instead of exactly one.

I created this test table:

CREATE TABLE test_case (
cod_pers STRING,
cod_address STRING,
PRIMARY KEY (cod_pers) DISABLE NOVALIDATE )
PARTITIONED BY (num_snapshot BIGINT) ;

and populated it with more than 500,000 rows in 3 partitions.

To find how many people changed address, I wrote this query:

WITH data1 AS ( -- eliminate duplicates
select t.cod_pers, t.cod_address, count(*) as num_address
from test_case as t
group by t.cod_pers, t.cod_address
), data2 AS ( -- find changes per person
select s.cod_pers, count(*) as num_changes
from data1 s
group by s.cod_pers
having count(*)>1 )
select count(*) as num_all_changes from data2 as gg ;

Instead of obtaining a single row with the total number, the query returns 4 rows :

+------------------+
| num_all_changes |
+------------------+
| 63 |
| 58 |
| 64 |
| 59 |
+------------------+
4 rows selected (1.252 seconds)

4 is also the number of reducers used by the query.

If I add a "LIMIT 1" clause, the query works as expected, and a final reducer is added in the query execution. The same happens adding "GROUP BY 1" clause.

I tried to COMPACT the table and have the statistics recalculated, but nothing changed.

I know I can rewrite the query, but I'm not looking for a work-aroud: I wonder if it's a known bug of CBO and if someone else experimented the same behavior (and also if a patch is available).

Thanks in advance

asish · ‎03-31-2021

Can you please try the below command:

select count(*) from data2;

It should return one single row .

giovannimori · ‎03-31-2021

Hi,

the simple count(*) on the base table is working correctly:

select count(*) from be_prd_prt.test_case;
+---------+
| _c0 |
+---------+
| 537586 |
+---------+
1 row selected (0.2 seconds)

I have troubles with the query I wrote, because it returns a wrong number of rows (not just 1, as expected)

asish · ‎04-05-2021

can you collect STATS and re-run the query??

giovannimori · ‎04-07-2021

Hi asish,

I collected statistics again (I already did it before posting), but the result is the same: 4 rows instead 1.

+------------------+
| num_all_changes |
+------------------+
| 63 |
| 58 |
| 64 |
| 59 |
+------------------+
4 rows selected (13.45 seconds)

Magudeswaran · ‎04-05-2021

Hi @giovannimori

can you please try to union the two subqueries and form a single view?

giovannimori · ‎04-07-2021

Hi @Magudeswaran ,

I can rewrite the query using sub-queries, but I have the same error:

select count(*) as num_all_changes
from (
select s.cod_pers, count(*) as num_changes
from (
select t.cod_pers, t.cod_address, count(*) as num_address
from be_prd_prt.test_case as t
group by t.cod_pers, t.cod_address
) as s
group by s.cod_pers
having count(*)>1
) as gg
;

+------------------+
| num_all_changes |
+------------------+
| 63 |
| 58 |
| 64 |
| 59 |
+------------------+
4 rows selected (18.077 seconds)

As you can see, always 4 rows...

Cloudera Community

Support Questions

Hive 3.1.4 Multiple rows from COUNT(*) query