Reply
New Contributor
Posts: 2
Registered: ‎09-19-2017

masking UFD function for impala

hi 

 

Need an effieciant impala function to mask/hide some charectors. java/hive is working fine with below.

for (int counter = 4; counter < len - 5; counter++)
if (input[counter] != ' ')
input[counter] = '*';
} else {
for (int counter = 0; counter < len; counter++) {
if (input[counter] != ' ')
input[counter] = '*';
}
}

so i need help with c++/impala

 

Regards

 

Cloudera Employee
Posts: 416
Registered: ‎07-29-2015

Re: masking UFD function for impala

 

[localhost:21000] > select concat(substring(l_comment, 1, 3), regexp_replace(substring(l_comment, 4, length(l_comment) - 3), '[^ ]', '*'), substring(l_comment, length(l_comment) - 3)) from tpch.lineitem limit 5;
Query: select concat(substring(l_comment, 1, 3), regexp_replace(substring(l_comment, 4, length(l_comment) - 3), '[^ ]', '*'), substring(l_comment, length(l_comment) - 3)) from tpch.lineitem limit 5
Query submitted at: 2017-09-25 07:38:09 (Coordinator: http://tarmstrong-box:25000)
Query progress can be monitored at: http://tarmstrong-box:25000/query_plan?query_id=34f1a993e3cb99a:51d89bf800000000
+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
| concat(substring(l_comment, 1, 3), regexp_replace(substring(l_comment, 4, length(l_comment) - 3), '[^ ]', '*'), substring(l_comment, length(l_comment) - 3)) |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
| egu*** ****** ***** *** the                                                                                                                                  |
| ly ***** ************* ***** **** old                                                                                                                        |
| rio***** ******** ******* *** dep                                                                                                                            |
| lit*** ******** **** **n de                                                                                                                                  |
|  pe***** ****** ***** **y re                                                                                                                                 |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------+

 

I'd recommend doing it with builtin functions since it will be easier to maintain. I included an example above of how you might do it using regexp_replace. I expect it will be quite fast - Impala's query compilation can inline functions like length() and substring() so those are essentially free in Impala (unlike many other SQL engines). The main cost is regexp_replace() but I'd expect that to be quite fast too.

 

Cloudera Employee
Posts: 416
Registered: ‎07-29-2015

Re: masking UFD function for impala

If you want to implement a C++ UDF though, I'd recommend starting with the docs here: https://www.cloudera.com/documentation/enterprise/latest/topics/impala_udf.html. There are some examples of string manipulation UDFs on that page.

Explorer
Posts: 6
Registered: ‎09-12-2017

Re: masking UFD function for impala

this has been resolve, UDFs are the best.

I wrote java and C++ codes, they check the current user. if user is not classified, data is masked for classified users it appear as is.

--final steps is to create UFS eg hrmask

--create a view with sensitive column preffixed with mask function.

[localhost.localdomain:21000] > create view redact_m as select Title,GivenName,Surname,hrmask(CCNumber),hrmask(idNumber) from redact;
Query: create view redact_m as select Title,GivenName,Surname,hrmask(CCNumber),hrmask(idNumber) from redact

Fetched 0 row(s) in 0.41s
[localhost.localdomain:21000] > select * from redact_m limit 2;
Query: select * from redact_m limit 2
+-------+-----------+---------+------------------+---------------+
| title | givenname | surname | _c3 | _c4 |
+-------+-----------+---------+------------------+---------------+
| Title | GivenName | Surname | NULL | NULL |
| Ms. | Eva | Howard | 5163458320525980 | 6345832052598 |
+-------+-----------+---------+------------------+---------------+
WARNINGS: Error converting column: 3 TO BIGINT (Data is: CCNumber)
Error converting column: 4 TO BIGINT (Data is: idNumber)
file: hdfs://localhost:8020/test/hive/fake.csv
record: Title,GivenName,Surname,CCNumber,idNumber

Fetched 2 row(s) in 4.74s
[localhost.localdomain:21000] >

 

results are clear as i logged as admin, but once log as other user, sensitive columns are masked.

[localhost.localdomain:21000] > select * from redact_m limit 2;
Query: select * from redact_m limit 2
+-------+-----------+---------+----------+----------+
| title | givenname | surname | _c3 | _c4 |
+-------+-----------+---------+----------+----------+
| Title | GivenName | Surname | NULL | NULL |
| Ms. | Eva | Howard | 99999999 | 99999999 |
+-------+-----------+---------+----------+----------+
WARNINGS: Error converting column: 3 TO BIGINT (Data is: CCNumber)
Error converting column: 4 TO BIGINT (Data is: idNumber)
file: hdfs://localhost:8020/test/hive/fake.csv
record: Title,GivenName,Surname,CCNumber,idNumber

Fetched 2 row(s) in 0.15s
[localhost.localdomain:21000] >

 

I wish if this could be a built in function coming with sentry by default.