Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Hive 3.1.4 Multiple rows from COUNT(*) query

avatar

Hi,

on Hive 3.1.4 we have a COUNT(*) query which returns more than one rows, instead of exactly one.

I created this test table:

CREATE TABLE test_case (
cod_pers STRING,
cod_address STRING,
PRIMARY KEY (cod_pers) DISABLE NOVALIDATE )
PARTITIONED BY (num_snapshot BIGINT) ;

and populated it with more than 500,000 rows in 3 partitions.

To find how many people changed address, I wrote this query:

WITH data1 AS ( -- eliminate duplicates
select t.cod_pers, t.cod_address, count(*) as num_address
from test_case as t
group by t.cod_pers, t.cod_address
), data2 AS ( -- find changes per person
select s.cod_pers, count(*) as num_changes
from data1 s
group by s.cod_pers
having count(*)>1 )
select count(*) as num_all_changes from data2 as gg ;

Instead of obtaining a single row with the total number, the query returns 4 rows :

+------------------+
| num_all_changes |
+------------------+
| 63 |
| 58 |
| 64 |
| 59 |
+------------------+
4 rows selected (1.252 seconds)

4 is also the number of reducers used by the query.

If I add a "LIMIT 1" clause, the query works as expected, and a final reducer is added in the query execution. The same happens adding "GROUP BY 1" clause.

 

I tried to COMPACT the table and have the statistics recalculated, but nothing changed.

I know I can rewrite the query, but I'm not looking for a work-aroud: I wonder if it's a known bug of CBO and if someone else experimented the same behavior (and also if a patch is available).

 

Thanks in advance

6 REPLIES 6

avatar
Guru

Can you please try the below command:

   select count(*) from  data2; 

 

It should return one single row .

avatar

Hi,

the simple count(*) on the base table is working correctly:

select count(*) from be_prd_prt.test_case;
+---------+
| _c0 |
+---------+
| 537586 |
+---------+
1 row selected (0.2 seconds)

I have troubles with the query I wrote, because it returns a wrong number of rows (not just 1, as expected)

avatar
Guru

can you collect STATS and re-run the query??

avatar

Hi asish,

I collected statistics again (I already did it before posting), but the result is the same: 4 rows instead 1.

+------------------+
| num_all_changes |
+------------------+
| 63 |
| 58 |
| 64 |
| 59 |
+------------------+
4 rows selected (13.45 seconds)

avatar
Contributor

Hi @giovannimori 

 

can you please try to union the two subqueries and form a single view?

avatar

Hi @Magudeswaran ,

I can rewrite the query using sub-queries, but I have the same error:

select count(*) as num_all_changes
from (
select s.cod_pers, count(*) as num_changes
from (
select t.cod_pers, t.cod_address, count(*) as num_address
from be_prd_prt.test_case as t
group by t.cod_pers, t.cod_address
) as s
group by s.cod_pers
having count(*)>1
) as gg
;

+------------------+
| num_all_changes |
+------------------+
| 63 |
| 58 |
| 64 |
| 59 |
+------------------+
4 rows selected (18.077 seconds)

 

As you can see, always 4 rows...