Which one is optimal to use in hive ? distinct or group by ? may i know how both of the will be processed in the background ?
Check the explain plan of both. I believe the distinct is re-written to a group-by by the planner.
@Ravi teja Based on my encounters, group by will be faster than distinct. Groupby is something similar to segregating the key, values which MR is capable of handling it with ease. I would say better to go with group by.
Gunther is right, Hive planner rewrites distinct using group by, so it doesn't matter what do you use from performace point of view.