Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here. Want to know more about what has changed? Check out the Community News blog.

masking UFD function for impala

masking UFD function for impala

New Contributor

hi 

 

Need an effieciant impala function to mask/hide some charectors. java/hive is working fine with below.

for (int counter = 4; counter < len - 5; counter++)
if (input[counter] != ' ')
input[counter] = '*';
} else {
for (int counter = 0; counter < len; counter++) {
if (input[counter] != ' ')
input[counter] = '*';
}
}

so i need help with c++/impala

 

Regards

 

3 REPLIES 3

Re: masking UFD function for impala

Master Collaborator

 

[localhost:21000] > select concat(substring(l_comment, 1, 3), regexp_replace(substring(l_comment, 4, length(l_comment) - 3), '[^ ]', '*'), substring(l_comment, length(l_comment) - 3)) from tpch.lineitem limit 5;
Query: select concat(substring(l_comment, 1, 3), regexp_replace(substring(l_comment, 4, length(l_comment) - 3), '[^ ]', '*'), substring(l_comment, length(l_comment) - 3)) from tpch.lineitem limit 5
Query submitted at: 2017-09-25 07:38:09 (Coordinator: http://tarmstrong-box:25000)
Query progress can be monitored at: http://tarmstrong-box:25000/query_plan?query_id=34f1a993e3cb99a:51d89bf800000000
+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
| concat(substring(l_comment, 1, 3), regexp_replace(substring(l_comment, 4, length(l_comment) - 3), '[^ ]', '*'), substring(l_comment, length(l_comment) - 3)) |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
| egu*** ****** ***** *** the                                                                                                                                  |
| ly ***** ************* ***** **** old                                                                                                                        |
| rio***** ******** ******* *** dep                                                                                                                            |
| lit*** ******** **** **n de                                                                                                                                  |
|  pe***** ****** ***** **y re                                                                                                                                 |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------+

 

I'd recommend doing it with builtin functions since it will be easier to maintain. I included an example above of how you might do it using regexp_replace. I expect it will be quite fast - Impala's query compilation can inline functions like length() and substring() so those are essentially free in Impala (unlike many other SQL engines). The main cost is regexp_replace() but I'd expect that to be quite fast too.

 

Re: masking UFD function for impala

Master Collaborator

If you want to implement a C++ UDF though, I'd recommend starting with the docs here: https://www.cloudera.com/documentation/enterprise/latest/topics/impala_udf.html. There are some examples of string manipulation UDFs on that page.

Highlighted

Re: masking UFD function for impala

Explorer

this has been resolve, UDFs are the best.

I wrote java and C++ codes, they check the current user. if user is not classified, data is masked for classified users it appear as is.

--final steps is to create UFS eg hrmask

--create a view with sensitive column preffixed with mask function.

[localhost.localdomain:21000] > create view redact_m as select Title,GivenName,Surname,hrmask(CCNumber),hrmask(idNumber) from redact;
Query: create view redact_m as select Title,GivenName,Surname,hrmask(CCNumber),hrmask(idNumber) from redact

Fetched 0 row(s) in 0.41s
[localhost.localdomain:21000] > select * from redact_m limit 2;
Query: select * from redact_m limit 2
+-------+-----------+---------+------------------+---------------+
| title | givenname | surname | _c3 | _c4 |
+-------+-----------+---------+------------------+---------------+
| Title | GivenName | Surname | NULL | NULL |
| Ms. | Eva | Howard | 5163458320525980 | 6345832052598 |
+-------+-----------+---------+------------------+---------------+
WARNINGS: Error converting column: 3 TO BIGINT (Data is: CCNumber)
Error converting column: 4 TO BIGINT (Data is: idNumber)
file: hdfs://localhost:8020/test/hive/fake.csv
record: Title,GivenName,Surname,CCNumber,idNumber

Fetched 2 row(s) in 4.74s
[localhost.localdomain:21000] >

 

results are clear as i logged as admin, but once log as other user, sensitive columns are masked.

[localhost.localdomain:21000] > select * from redact_m limit 2;
Query: select * from redact_m limit 2
+-------+-----------+---------+----------+----------+
| title | givenname | surname | _c3 | _c4 |
+-------+-----------+---------+----------+----------+
| Title | GivenName | Surname | NULL | NULL |
| Ms. | Eva | Howard | 99999999 | 99999999 |
+-------+-----------+---------+----------+----------+
WARNINGS: Error converting column: 3 TO BIGINT (Data is: CCNumber)
Error converting column: 4 TO BIGINT (Data is: idNumber)
file: hdfs://localhost:8020/test/hive/fake.csv
record: Title,GivenName,Surname,CCNumber,idNumber

Fetched 2 row(s) in 0.15s
[localhost.localdomain:21000] >

 

I wish if this could be a built in function coming with sentry by default.