About giovannimori

giovannimori · ‎04-07-2021

Hi @Magudeswaran , I can rewrite the query using sub-queries, but I have the same error: select count(*) as num_all_changes from ( select s.cod_pers, count(*) as num_changes from ( select t.cod_pers, t.cod_address, count(*) as num_address from be_prd_prt.test_case as t group by t.cod_pers, t.cod_address ) as s group by s.cod_pers having count(*)>1 ) as gg ; +------------------+ | num_all_changes | +------------------+ | 63 | | 58 | | 64 | | 59 | +------------------+ 4 rows selected (18.077 seconds) As you can see, always 4 rows...

giovannimori · ‎04-07-2021

Hi asish, I collected statistics again (I already did it before posting), but the result is the same: 4 rows instead 1. +------------------+ | num_all_changes | +------------------+ | 63 | | 58 | | 64 | | 59 | +------------------+ 4 rows selected (13.45 seconds)

giovannimori · ‎03-31-2021

The "msck repair table ..." command does not really read new data files, but adds new partitions (subdirectories in HDFS) in table metadata. What you could do is to create in advance all the partitions (for month or more) - initially empty- and run the "repair" command just once: hdfs dfs -mkdir /user/data/year=2021/month=04/day=1 ... hdfs dfs -mkdir /user/data/year=2021/month=04/day=30 hive>msck repair table <table_name> When You put your log files inside one of these directories, they will be immediately visible from Hive (just set the correct permissions using Ranger or hdfs). Maybe You can repeat this operations (create directories and "repair table") during logs maintenance, as you should have some policies to remove old logs Hope this helps

giovannimori · ‎03-31-2021

Hi, the simple count(*) on the base table is working correctly: select count(*) from be_prd_prt.test_case; +---------+ | _c0 | +---------+ | 537586 | +---------+ 1 row selected (0.2 seconds) I have troubles with the query I wrote, because it returns a wrong number of rows (not just 1, as expected)

giovannimori · ‎03-31-2021

Hi, on Hive 3.1.4 we have a COUNT(*) query which returns more than one rows, instead of exactly one. I created this test table: CREATE TABLE test_case ( cod_pers STRING, cod_address STRING, PRIMARY KEY (cod_pers) DISABLE NOVALIDATE ) PARTITIONED BY (num_snapshot BIGINT) ; and populated it with more than 500,000 rows in 3 partitions. To find how many people changed address, I wrote this query: WITH data1 AS ( -- eliminate duplicates select t.cod_pers, t.cod_address, count(*) as num_address from test_case as t group by t.cod_pers, t.cod_address ), data2 AS ( -- find changes per person select s.cod_pers, count(*) as num_changes from data1 s group by s.cod_pers having count(*)>1 ) select count(*) as num_all_changes from data2 as gg ; Instead of obtaining a single row with the total number, the query returns 4 rows : +------------------+ | num_all_changes | +------------------+ | 63 | | 58 | | 64 | | 59 | +------------------+ 4 rows selected (1.252 seconds) 4 is also the number of reducers used by the query. If I add a "LIMIT 1" clause, the query works as expected, and a final reducer is added in the query execution. The same happens adding "GROUP BY 1" clause. I tried to COMPACT the table and have the statistics recalculated, but nothing changed. I know I can rewrite the query, but I'm not looking for a work-aroud: I wonder if it's a known bug of CBO and if someone else experimented the same behavior (and also if a patch is available). Thanks in advance

Online	Offline
Last Visited	‎04-08-2021 12:59 AM

Member Since	‎03-04-2021 02:51 AM
Last Visited	‎04-08-2021 12:59 AM
Posts	5
Kudos received	2

Cloudera Community

Re: Hive 3.1.4 Multiple rows from COUNT(*) query

Re: Hive 3.1.4 Multiple rows from COUNT(*) query

Re: Hive external table partition update automatic...

Re: Hive 3.1.4 Multiple rows from COUNT(*) query

Hive 3.1.4 Multiple rows from COUNT(*) query