I have a big tables hive. In which to identify a unique record I need to check 4 columns, lets say col_A, col_B, col_C and col_D. I use these columns to identify changed records and overwrite HIVE table partitions.
I am trying several options to do this:
1) join on all 4 columns separately
2) Concat all 4 fields to create a key column (string type).
3) use step-2 and convert string to binary (is binary faster to join on?).
4) Use other functions like HEX, MD5 (string), SHA1/2 or CR32 to form a key column and join on that single key
Have you used anything similar in your project? What gave a better performance? Is there any other better way to join ? OR has hive improved that it no longer matters?
Trust me dont go with creating hash code by using md5 or hash. I have faced a similar problem where i have to join a single table which holds 40billion records with another. The column used for joining is created using hash function by combining all the fields. It was a nightmare for us. It ran ran for 7 hours occupying the entire cluster that no more jobs can be triggered. Based on experience it always better to go with 4 joins rather then concating the fields or generating the hash and joinin based on that.
I don't think there is a straight forward answer to your question.
The choice of an approach will heavily depend on data in those four columns, and the way you store your data.
If you use ORC and Tez, there are some articles about optimization that can be done on a data and metadata level.
To your approaches:
1) Join on all 4 columns: select the field that has least repeating values and insert data sorted (or even ordered, if it is not too heavy operation for your cluster), use "analyze table" and use CBO during the call.
2) Same as one. Concatenated field will be unique and this is a perfect case for calculating stripe and stride size in order to get best performance on lookup / join
3) What is "binary"? if you mean any time of hashing - see next; else if number of bytes in binary representation remains the same as in string - then it just doesn't matter, in my opinion, keep strings - easier to debug and audit
4) Any type of hashing, encoding/encryption can have collisions, so you will need to check original fields anyway. And I don't think you can gain any value of having shorter values for that column.
So, I would go either with #1 or #2 - again, depends on your data.
At the end of that article I provided more links, check them and make your choice.
Good luck in finding your way 🙂