About jbarnett1

jbarnett1 · ‎06-25-2019

I have a cute little hql snippet that I have been using to create test data using explode and the row_number() windowing function. Come to find out, it is making duplicate rows! If I generate 1,000 rows then I get unique row numbers. But larger number of rows generate duplicates. Weird thing is it seems to produce 1,024 unique values no matter how many rows > 1000 that you generate. Just the bigger you get, the higher the count of these duplicate rows! Anybody see any patterns here?!? DROP TABLE IF EXISTS default.dummy_data; CREATE TABLE default.dummy_data AS SELECT row_nbr as key FROM ( SELECT row_number() OVER (partition by '1') as row_nbr --SELECT row_number() over() as row_nbr FROM ( --select explode(split(repeat("x,", 1000-1), ",")) -- 1,000 distinct --select explode(split(repeat("x,", 10000-1), ",")) -- 1,024 distinct -- select explode(split(repeat("x,", 100000-1), ",")) -- 1,024 distinct select explode(split(repeat("x,", 1000000-1), ",")) -- 1,024 distinct ) AS x ) AS y; Here is a sample of the duplicate row counts from 1M rows generated. select count(distinct key) as c from default.dummy_data; +-------+------+ | key | c | +-------+------+ | 1 | 977 | | 8 | 977 | | 10 | 977 | | 11 | 977 | | 12 | 977 | | 19 | 977 | | 21 | 977 | | 28 | 977 | | 29 | 977 | | 30 | 977 | | 33 | 977 | | 34 | 977 | ... +-------+------+ | key | c | +-------+------+ | 627 | 976 | | 628 | 976 | | 631 | 976 | | 634 | 976 | | 636 | 976 | | 637 | 976 | | 638 | 976 | | 639 | 976 | | 640 | 976 | | 641 | 976 | | 652 | 976 | | 653 | 976 | | 657 | 976 | | 659 | 976 | | 660 | 976 | | 661 | 976 | ... | 2 | 977 | | 3 | 977 | | 4 | 977 | | 5 | 977 | | 6 | 977 | | 7 | 977 | | 9 | 977 |

Online	Offline
Last Visited	‎06-25-2019 05:39 PM

Member Since	‎09-30-2017 06:06 AM
Last Visited	‎06-25-2019 05:39 PM
Posts	2

Cloudera Community

Hive row_number() function generating duplicate ro...