Member since
03-04-2021
5
Posts
2
Kudos Received
0
Solutions
04-07-2021
12:31 AM
Hi @Magudeswaran , I can rewrite the query using sub-queries, but I have the same error: select count(*) as num_all_changes from ( select s.cod_pers, count(*) as num_changes from ( select t.cod_pers, t.cod_address, count(*) as num_address from be_prd_prt.test_case as t group by t.cod_pers, t.cod_address ) as s group by s.cod_pers having count(*)>1 ) as gg ; +------------------+ | num_all_changes | +------------------+ | 63 | | 58 | | 64 | | 59 | +------------------+ 4 rows selected (18.077 seconds) As you can see, always 4 rows...
... View more
04-07-2021
12:26 AM
Hi asish, I collected statistics again (I already did it before posting), but the result is the same: 4 rows instead 1. +------------------+ | num_all_changes | +------------------+ | 63 | | 58 | | 64 | | 59 | +------------------+ 4 rows selected (13.45 seconds)
... View more
03-31-2021
07:59 AM
1 Kudo
The "msck repair table ..." command does not really read new data files, but adds new partitions (subdirectories in HDFS) in table metadata. What you could do is to create in advance all the partitions (for month or more) - initially empty- and run the "repair" command just once: hdfs dfs -mkdir /user/data/year=2021/month=04/day=1 ... hdfs dfs -mkdir /user/data/year=2021/month=04/day=30 hive>msck repair table <table_name> When You put your log files inside one of these directories, they will be immediately visible from Hive (just set the correct permissions using Ranger or hdfs). Maybe You can repeat this operations (create directories and "repair table") during logs maintenance, as you should have some policies to remove old logs Hope this helps
... View more
03-31-2021
02:18 AM
1 Kudo
Hi, the simple count(*) on the base table is working correctly: select count(*) from be_prd_prt.test_case; +---------+ | _c0 | +---------+ | 537586 | +---------+ 1 row selected (0.2 seconds) I have troubles with the query I wrote, because it returns a wrong number of rows (not just 1, as expected)
... View more
03-31-2021
01:17 AM
Hi, on Hive 3.1.4 we have a COUNT(*) query which returns more than one rows, instead of exactly one. I created this test table: CREATE TABLE test_case ( cod_pers STRING, cod_address STRING, PRIMARY KEY (cod_pers) DISABLE NOVALIDATE ) PARTITIONED BY (num_snapshot BIGINT) ; and populated it with more than 500,000 rows in 3 partitions. To find how many people changed address, I wrote this query: WITH data1 AS ( -- eliminate duplicates select t.cod_pers, t.cod_address, count(*) as num_address from test_case as t group by t.cod_pers, t.cod_address ), data2 AS ( -- find changes per person select s.cod_pers, count(*) as num_changes from data1 s group by s.cod_pers having count(*)>1 ) select count(*) as num_all_changes from data2 as gg ; Instead of obtaining a single row with the total number, the query returns 4 rows : +------------------+ | num_all_changes | +------------------+ | 63 | | 58 | | 64 | | 59 | +------------------+ 4 rows selected (1.252 seconds) 4 is also the number of reducers used by the query. If I add a "LIMIT 1" clause, the query works as expected, and a final reducer is added in the query execution. The same happens adding "GROUP BY 1" clause. I tried to COMPACT the table and have the statistics recalculated, but nothing changed. I know I can rewrite the query, but I'm not looking for a work-aroud: I wonder if it's a known bug of CBO and if someone else experimented the same behavior (and also if a patch is available). Thanks in advance
... View more
Labels: