Support Questions
Find answers, ask questions, and share your expertise

Hive - joins



I have two table with 4 million and 200 million records each, both in ORC format.

Assuming the smaller table does not fit in memory, is SMB join the best way to go about it, if I have to join and match these tables?

If there are newer partitions every day adding to the bigger table, what optimizations can we do on this?

Thanks in advance.



Apart from SMB I would try to go with bucketing. Bucketing should be on both the tables and columns used for bucketing is chosen in such a way that it can be used during joins. Data Skew should also be considered. If there are data skews then hive tables has to created using skew. Also check if the size of file created in each partition directory atleast has more than block size. If not you may either have to change the partition technique or should use concatenate to merge small files in hive.

Check these links

Above links are for generalized join optimization in hive. Hope it Helps!!

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.