Created 06-20-2018 03:41 AM
screen-shot-2018-06-19-at-104245-pm.png
Count distinct doesn't always give me the right answer. I've attached two different queries that should both result in 7 unique items purchased. If I don't do an operation on mdse_item_i like cast it to a bigint, it doesn't always count them correctly.
to explain it simply, when i do cast on "mdse_itcountsdistinct-1.txtdoesnotcountdistinct.txtem_i" it gives unique results as 7, but when i don't do casting it gives unique results as 10 which is not correct.
hive> select * from dfr_distinct; OK 100000000938 5 7 12.33 2 2.75 4.27 8.060 2 8 0 Time taken: 0.479 seconds, Fetched: 1 row(s) hive> select * from dfr_distinctnot; OK 100000000938 5 10 12.33 2 2.75 4.27 8.06 0 2 8 0 Time taken: 0.932 seconds, Fetched: 1 row(s)
tried running the query in both MR and Tez modes still giving same results when i don't do casting.
Created 06-20-2018 01:36 PM
What datatype is mdse_item_i originally? Can you paste the output here for when you get the 7 distinct values versus the 10 distinct values? I'd like to see what the difference is.
Created 06-20-2018 05:56 PM
screen-shot-2018-06-20-at-125515-pm.png@anarasimham thanks for your reply.
mdse_item_i data type is int
`mdse_item_i` int
and the output attached as screenshot
Created 06-21-2018 11:34 AM
How many rows of data do you have?
Can you test this issue out with a new set of tables with a few rows in them? Can you try removing the rest of the columns to remove any extra variables?
Created 06-21-2018 03:42 PM
@Jalender Can you do distinct on mdse_item_i column alone with and without casting and share me the results? I think some leading spaces sometimes could lead to these issues.