Count distinct doesn't always give me the right answer. I've attached two different queries that should both result in 7 unique items purchased. If I don't do an operation on mdse_item_i like cast it to a bigint, it doesn't always count them correctly.
to explain it simply, when i do cast on "mdse_itcountsdistinct-1.txtdoesnotcountdistinct.txtem_i" it gives unique results as 7, but when i don't do casting it gives unique results as 10 which is not correct.
hive> select * from dfr_distinct; OK 100000000938 5 7 12.33 2 2.75 4.27 8.060 2 8 0 Time taken: 0.479 seconds, Fetched: 1 row(s) hive> select * from dfr_distinctnot; OK 100000000938 5 10 12.33 2 2.75 4.27 8.06 0 2 8 0 Time taken: 0.932 seconds, Fetched: 1 row(s)
tried running the query in both MR and Tez modes still giving same results when i don't do casting.
What datatype is mdse_item_i originally? Can you paste the output here for when you get the 7 distinct values versus the 10 distinct values? I'd like to see what the difference is.
How many rows of data do you have?
Can you test this issue out with a new set of tables with a few rows in them? Can you try removing the rest of the columns to remove any extra variables?