Member since
09-30-2017
2
Posts
0
Kudos Received
0
Solutions
06-25-2019
04:11 AM
I have a cute little hql snippet that I have been using to create test data using explode and the row_number() windowing function. Come to find out, it is making duplicate rows! If I generate 1,000 rows then I get unique row numbers. But larger number of rows generate duplicates. Weird thing is it seems to produce 1,024 unique values no matter how many rows > 1000 that you generate. Just the bigger you get, the higher the count of these duplicate rows! Anybody see any patterns here?!? DROP TABLE IF EXISTS default.dummy_data;
CREATE TABLE default.dummy_data
AS
SELECT row_nbr as key
FROM (
SELECT row_number() OVER (partition by '1') as row_nbr
--SELECT row_number() over() as row_nbr
FROM (
--select explode(split(repeat("x,", 1000-1), ",")) -- 1,000 distinct
--select explode(split(repeat("x,", 10000-1), ",")) -- 1,024 distinct
-- select explode(split(repeat("x,", 100000-1), ",")) -- 1,024 distinct
select explode(split(repeat("x,", 1000000-1), ",")) -- 1,024 distinct
) AS x
) AS y; Here is a sample of the duplicate row counts from 1M rows generated. select count(distinct key) as c from default.dummy_data;
+-------+------+
| key | c |
+-------+------+
| 1 | 977 |
| 8 | 977 |
| 10 | 977 |
| 11 | 977 |
| 12 | 977 |
| 19 | 977 |
| 21 | 977 |
| 28 | 977 |
| 29 | 977 |
| 30 | 977 |
| 33 | 977 |
| 34 | 977 |
...
+-------+------+
| key | c |
+-------+------+
| 627 | 976 |
| 628 | 976 |
| 631 | 976 |
| 634 | 976 |
| 636 | 976 |
| 637 | 976 |
| 638 | 976 |
| 639 | 976 |
| 640 | 976 |
| 641 | 976 |
| 652 | 976 |
| 653 | 976 |
| 657 | 976 |
| 659 | 976 |
| 660 | 976 |
| 661 | 976 |
...
| 2 | 977 |
| 3 | 977 |
| 4 | 977 |
| 5 | 977 |
| 6 | 977 |
| 7 | 977 |
| 9 | 977 |
... View more
Labels:
- Labels:
-
Apache Hive