Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to Mask the Hive columns data in Join operator, due to data criticality

avatar
Expert Contributor

Scenario as below

1) Have two tables A and B , having the column Account_Number.

2) If we did the join it should run fine and give result with Account_Number.

3) However this Account_number to be masked in out put ( its critical data ).

4) we have get out put, Account_number should not display all data, its should be like XXXXXX12345.

Solition 1: In Ranger we can disable the Account_number colum name, if we did it join objeration will fail.

Join has to run on Account_number only and this data need to masked ?

Is it possible to do.

Your concerns are appriciated.

1 ACCEPTED SOLUTION

avatar
Contributor

@zkfs, mask_show_last_n udf can be used for the scenario you mentioned.

You can run this query from the beeline to get additional details about udf:

DESCRIBE FUNCTION extended mask_show_last_n;

+---------------------------------------------------------------------------------------------------------------------------------+--+

|                                                            tab_name                                                             |

+---------------------------------------------------------------------------------------------------------------------------------+--+

| masks all but last n characters of the value                                                                                    |

| Examples:                                                                                                                       |

|    mask_show_last_n(ccn, 😎                                                                                                     |

|    mask_show_last_n(ccn, 8, 'x', 'x', 'x')                                                                                      |

|  Arguments:                                                                                                                     |

|    mask_show_last_n(value, charCount, upperChar, lowerChar, digitChar, otherChar, numberChar)                                   |

|      value      - value to mask. Supported types: TINYINT, SMALLINT, INT, BIGINT, STRING, VARCHAR, CHAR                         |

|      charCount  - number of characters. Default value: 4                                                                        |

|      upperChar  - character to replace upper-case characters with. Specify -1 to retain original character. Default value: 'X'  |

|      lowerChar  - character to replace lower-case characters with. Specify -1 to retain original character. Default value: 'x'  |

|      digitChar  - character to replace digit characters with. Specify -1 to retain original character. Default value: 'n'       |

|      otherChar  - character to replace all other characters with. Specify -1 to retain original character. Default value: -1    |

|      numberChar - character to replace digits in a number with. Valid values: 0-9. Default value: '1'                           |

| NULL                                                                                                                            |

+---------------------------------------------------------------------------------------------------------------------------------+--+

Following query returns string with all the characters masked except last 4:

select mask_show_last_n('AAbb1234567', 4, 'X', 'X', 'X');
+--------------+--+

|     _c0      |

+--------------+--+

| XXXXXXX4567  |

+--------------+--+

1 row selected (0.085 seconds

View solution in original post

3 REPLIES 3

avatar
Contributor

You can use data masking udfs instead of disabling the column.

You can find more details from here:

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DataMaskingFun...

https://issues.apache.org/jira/browse/HIVE-13568

avatar
Expert Contributor

thanks for your reply.

Have you implemented the UDF for this Masking the columnn, if yes please share sample code..

Thank you

avatar
Contributor

@zkfs, mask_show_last_n udf can be used for the scenario you mentioned.

You can run this query from the beeline to get additional details about udf:

DESCRIBE FUNCTION extended mask_show_last_n;

+---------------------------------------------------------------------------------------------------------------------------------+--+

|                                                            tab_name                                                             |

+---------------------------------------------------------------------------------------------------------------------------------+--+

| masks all but last n characters of the value                                                                                    |

| Examples:                                                                                                                       |

|    mask_show_last_n(ccn, 😎                                                                                                     |

|    mask_show_last_n(ccn, 8, 'x', 'x', 'x')                                                                                      |

|  Arguments:                                                                                                                     |

|    mask_show_last_n(value, charCount, upperChar, lowerChar, digitChar, otherChar, numberChar)                                   |

|      value      - value to mask. Supported types: TINYINT, SMALLINT, INT, BIGINT, STRING, VARCHAR, CHAR                         |

|      charCount  - number of characters. Default value: 4                                                                        |

|      upperChar  - character to replace upper-case characters with. Specify -1 to retain original character. Default value: 'X'  |

|      lowerChar  - character to replace lower-case characters with. Specify -1 to retain original character. Default value: 'x'  |

|      digitChar  - character to replace digit characters with. Specify -1 to retain original character. Default value: 'n'       |

|      otherChar  - character to replace all other characters with. Specify -1 to retain original character. Default value: -1    |

|      numberChar - character to replace digits in a number with. Valid values: 0-9. Default value: '1'                           |

| NULL                                                                                                                            |

+---------------------------------------------------------------------------------------------------------------------------------+--+

Following query returns string with all the characters masked except last 4:

select mask_show_last_n('AAbb1234567', 4, 'X', 'X', 'X');
+--------------+--+

|     _c0      |

+--------------+--+

| XXXXXXX4567  |

+--------------+--+

1 row selected (0.085 seconds