Created 06-25-2019 04:11 AM
I have a cute little hql snippet that I have been using to create test data using explode and the row_number() windowing function. Come to find out, it is making duplicate rows! If I generate 1,000 rows then I get unique row numbers. But larger number of rows generate duplicates. Weird thing is it seems to produce 1,024 unique values no matter how many rows > 1000 that you generate. Just the bigger you get, the higher the count of these duplicate rows!
Anybody see any patterns here?!?
DROP TABLE IF EXISTS default.dummy_data; CREATE TABLE default.dummy_data AS SELECT row_nbr as key FROM ( SELECT row_number() OVER (partition by '1') as row_nbr --SELECT row_number() over() as row_nbr FROM ( --select explode(split(repeat("x,", 1000-1), ",")) -- 1,000 distinct --select explode(split(repeat("x,", 10000-1), ",")) -- 1,024 distinct -- select explode(split(repeat("x,", 100000-1), ",")) -- 1,024 distinct select explode(split(repeat("x,", 1000000-1), ",")) -- 1,024 distinct ) AS x ) AS y;
Here is a sample of the duplicate row counts from 1M rows generated.
select count(distinct key) as c from default.dummy_data; +-------+------+ | key | c | +-------+------+ | 1 | 977 | | 8 | 977 | | 10 | 977 | | 11 | 977 | | 12 | 977 | | 19 | 977 | | 21 | 977 | | 28 | 977 | | 29 | 977 | | 30 | 977 | | 33 | 977 | | 34 | 977 | ... +-------+------+ | key | c | +-------+------+ | 627 | 976 | | 628 | 976 | | 631 | 976 | | 634 | 976 | | 636 | 976 | | 637 | 976 | | 638 | 976 | | 639 | 976 | | 640 | 976 | | 641 | 976 | | 652 | 976 | | 653 | 976 | | 657 | 976 | | 659 | 976 | | 660 | 976 | | 661 | 976 | ... | 2 | 977 | | 3 | 977 | | 4 | 977 | | 5 | 977 | | 6 | 977 | | 7 | 977 | | 9 | 977 |
Created 06-25-2019 06:02 PM
Not able to recreate the same scenario on my end(using hive 1.2.1).
Wondering how count(distinct key) is generated just one row of count as result ,but in your case query is giving results as
select key,count(*) from default.dummy_data group by key;
-----
Check is there any data that already exists in hdfs directory that table is pointing to,
If yes clear off the data from the hdfs directory and recreate the table again.
Example:
hive> CREATE TABLE default.dummy_data > AS > SELECT row_nbr as key > FROM ( > SELECT row_number() OVER (partition by '1') as row_nbr > FROM ( > select explode(split(repeat("x,", 1000000-1), ",")) -- 1,024 distinct > ) AS x > ) AS y;
hive> select count(distinct key) c from default.dummy_data; 1000000
hive> select key,count(*)cnt from default.dummy_data group by key order by cnt desc limit 10; --ordering desc and limiting 10
Result:
key cnt 10 1 9 1 1000000 1 7 1 6 1 5 1 4 1 3 1 999999 1 1 1
Created 08-01-2019 12:21 AM
I got the same problem. I have a table with over 3k records. I ran the following query in Hive and see the Key1 column recycling the number from 1 to 1024.
select row_number() over () as Key1 from mytable;
Created 08-01-2019 06:44 PM
The problem went away after I changed my script adding "order by field1".
select row_number() over (order by Field1) as Key1 from mytable;