Support Questions

Find answers, ask questions, and share your expertise

Count Distinct discrepancy --Hive

avatar
Contributor

screen-shot-2018-06-19-at-104245-pm.png

Count distinct doesn't always give me the right answer. I've attached two different queries that should both result in 7 unique items purchased. If I don't do an operation on mdse_item_i like cast it to a bigint, it doesn't always count them correctly.

to explain it simply, when i do cast on "mdse_itcountsdistinct-1.txtdoesnotcountdistinct.txtem_i" it gives unique results as 7, but when i don't do casting it gives unique results as 10 which is not correct.

hive> select * from dfr_distinct; OK 100000000938 5 7 12.33 2 2.75 4.27 8.060 2 8 0 Time taken: 0.479 seconds, Fetched: 1 row(s) hive> select * from dfr_distinctnot; OK 100000000938 5 10 12.33 2 2.75 4.27 8.06 0 2 8 0 Time taken: 0.932 seconds, Fetched: 1 row(s)

tried running the query in both MR and Tez modes still giving same results when i don't do casting.

4 REPLIES 4

avatar
Super Collaborator

What datatype is mdse_item_i originally? Can you paste the output here for when you get the 7 distinct values versus the 10 distinct values? I'd like to see what the difference is.

avatar
Contributor

screen-shot-2018-06-20-at-125515-pm.png@anarasimham thanks for your reply.

mdse_item_i data type is int

`mdse_item_i` int

and the output attached as screenshot

avatar
Super Collaborator

How many rows of data do you have?

Can you test this issue out with a new set of tables with a few rows in them? Can you try removing the rest of the columns to remove any extra variables?

avatar
New Contributor

@Jalender Can you do distinct on mdse_item_i column alone with and without casting and share me the results? I think some leading spaces sometimes could lead to these issues.