Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

masking UFD function for impala

avatar
New Contributor

hi 

 

Need an effieciant impala function to mask/hide some charectors. java/hive is working fine with below.

for (int counter = 4; counter < len - 5; counter++)
if (input[counter] != ' ')
input[counter] = '*';
} else {
for (int counter = 0; counter < len; counter++) {
if (input[counter] != ' ')
input[counter] = '*';
}
}

so i need help with c++/impala

 

Regards

 

3 REPLIES 3

avatar

 

[localhost:21000] > select concat(substring(l_comment, 1, 3), regexp_replace(substring(l_comment, 4, length(l_comment) - 3), '[^ ]', '*'), substring(l_comment, length(l_comment) - 3)) from tpch.lineitem limit 5;
Query: select concat(substring(l_comment, 1, 3), regexp_replace(substring(l_comment, 4, length(l_comment) - 3), '[^ ]', '*'), substring(l_comment, length(l_comment) - 3)) from tpch.lineitem limit 5
Query submitted at: 2017-09-25 07:38:09 (Coordinator: http://tarmstrong-box:25000)
Query progress can be monitored at: http://tarmstrong-box:25000/query_plan?query_id=34f1a993e3cb99a:51d89bf800000000
+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
| concat(substring(l_comment, 1, 3), regexp_replace(substring(l_comment, 4, length(l_comment) - 3), '[^ ]', '*'), substring(l_comment, length(l_comment) - 3)) |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
| egu*** ****** ***** *** the                                                                                                                                  |
| ly ***** ************* ***** **** old                                                                                                                        |
| rio***** ******** ******* *** dep                                                                                                                            |
| lit*** ******** **** **n de                                                                                                                                  |
|  pe***** ****** ***** **y re                                                                                                                                 |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------+

 

I'd recommend doing it with builtin functions since it will be easier to maintain. I included an example above of how you might do it using regexp_replace. I expect it will be quite fast - Impala's query compilation can inline functions like length() and substring() so those are essentially free in Impala (unlike many other SQL engines). The main cost is regexp_replace() but I'd expect that to be quite fast too.

 

avatar

If you want to implement a C++ UDF though, I'd recommend starting with the docs here: https://www.cloudera.com/documentation/enterprise/latest/topics/impala_udf.html. There are some examples of string manipulation UDFs on that page.

avatar
Explorer

this has been resolve, UDFs are the best.

I wrote java and C++ codes, they check the current user. if user is not classified, data is masked for classified users it appear as is.

--final steps is to create UFS eg hrmask

--create a view with sensitive column preffixed with mask function.

[localhost.localdomain:21000] > create view redact_m as select Title,GivenName,Surname,hrmask(CCNumber),hrmask(idNumber) from redact;
Query: create view redact_m as select Title,GivenName,Surname,hrmask(CCNumber),hrmask(idNumber) from redact

Fetched 0 row(s) in 0.41s
[localhost.localdomain:21000] > select * from redact_m limit 2;
Query: select * from redact_m limit 2
+-------+-----------+---------+------------------+---------------+
| title | givenname | surname | _c3 | _c4 |
+-------+-----------+---------+------------------+---------------+
| Title | GivenName | Surname | NULL | NULL |
| Ms. | Eva | Howard | 5163458320525980 | 6345832052598 |
+-------+-----------+---------+------------------+---------------+
WARNINGS: Error converting column: 3 TO BIGINT (Data is: CCNumber)
Error converting column: 4 TO BIGINT (Data is: idNumber)
file: hdfs://localhost:8020/test/hive/fake.csv
record: Title,GivenName,Surname,CCNumber,idNumber

Fetched 2 row(s) in 4.74s
[localhost.localdomain:21000] >

 

results are clear as i logged as admin, but once log as other user, sensitive columns are masked.

[localhost.localdomain:21000] > select * from redact_m limit 2;
Query: select * from redact_m limit 2
+-------+-----------+---------+----------+----------+
| title | givenname | surname | _c3 | _c4 |
+-------+-----------+---------+----------+----------+
| Title | GivenName | Surname | NULL | NULL |
| Ms. | Eva | Howard | 99999999 | 99999999 |
+-------+-----------+---------+----------+----------+
WARNINGS: Error converting column: 3 TO BIGINT (Data is: CCNumber)
Error converting column: 4 TO BIGINT (Data is: idNumber)
file: hdfs://localhost:8020/test/hive/fake.csv
record: Title,GivenName,Surname,CCNumber,idNumber

Fetched 2 row(s) in 0.15s
[localhost.localdomain:21000] >

 

I wish if this could be a built in function coming with sentry by default.