Support Questions

Find answers, ask questions, and share your expertise

Hive row_number() function generating duplicate row numbers

avatar
New Contributor

I have a cute little hql snippet that I have been using to create test data using explode and the row_number() windowing function. Come to find out, it is making duplicate rows! If I generate 1,000 rows then I get unique row numbers. But larger number of rows generate duplicates. Weird thing is it seems to produce 1,024 unique values no matter how many rows > 1000 that you generate. Just the bigger you get, the higher the count of these duplicate rows!

Anybody see any patterns here?!?


DROP TABLE IF EXISTS default.dummy_data;
CREATE TABLE default.dummy_data 
AS
SELECT row_nbr as key
FROM (
SELECT row_number() OVER (partition by '1') as row_nbr
--SELECT row_number() over() as row_nbr
FROM (
--select explode(split(repeat("x,", 1000-1), ",")) -- 1,000 distinct
--select explode(split(repeat("x,", 10000-1), ",")) -- 1,024 distinct
-- select explode(split(repeat("x,", 100000-1), ",")) -- 1,024 distinct
select explode(split(repeat("x,", 1000000-1), ",")) -- 1,024 distinct
) AS x
) AS y;

Here is a sample of the duplicate row counts from 1M rows generated.

select count(distinct key) as c from default.dummy_data;

+-------+------+
|  key  |  c   |
+-------+------+
| 1     | 977  |
| 8     | 977  |
| 10    | 977  |
| 11    | 977  |
| 12    | 977  |
| 19    | 977  |
| 21    | 977  |
| 28    | 977  |
| 29    | 977  |
| 30    | 977  |
| 33    | 977  |
| 34    | 977  |
...
+-------+------+
|  key  |  c   |
+-------+------+
| 627   | 976  |
| 628   | 976  |
| 631   | 976  |
| 634   | 976  |
| 636   | 976  |
| 637   | 976  |
| 638   | 976  |
| 639   | 976  |
| 640   | 976  |
| 641   | 976  |
| 652   | 976  |
| 653   | 976  |
| 657   | 976  |
| 659   | 976  |
| 660   | 976  |
| 661   | 976  |

...
| 2     | 977  |
| 3     | 977  |
| 4     | 977  |
| 5     | 977  |
| 6     | 977  |
| 7     | 977  |
| 9     | 977  |



3 REPLIES 3

avatar
Master Guru

@Jim Barnett

Not able to recreate the same scenario on my end(using hive 1.2.1).

Wondering how count(distinct key) is generated just one row of count as result ,but in your case query is giving results as

select key,count(*) from default.dummy_data group by key;

-----

Check is there any data that already exists in hdfs directory that table is pointing to,

If yes clear off the data from the hdfs directory and recreate the table again.

Example:

hive> CREATE TABLE default.dummy_data
    > AS
    > SELECT row_nbr as key
    > FROM (
    > SELECT row_number() OVER (partition by '1') as row_nbr
    > FROM (
    > select explode(split(repeat("x,", 1000000-1), ",")) -- 1,024 distinct
    > ) AS x
    > ) AS y;
hive> select count(distinct key) c from default.dummy_data;
1000000
hive> select key,count(*)cnt from default.dummy_data group by key order by cnt desc limit 10; --ordering desc and limiting 10

Result:

key     cnt
10      1
9       1
1000000 1
7       1
6       1
5       1
4       1
3       1
999999  1
1       1




avatar
Expert Contributor

I got the same problem. I have a table with over 3k records. I ran the following query in Hive and see the Key1 column recycling the number from 1 to 1024.

select row_number() over () as Key1 from mytable;


avatar
Expert Contributor

The problem went away after I changed my script adding "order by field1".

select row_number() over (order by Field1) as Key1 from mytable;